Running a 2-GPU Local LLM (Iona Stack)
What This Actually Is
This page documents a working single-machine local LLM stack built around llama.cpp, Open WebUI, and optional Caddy. The point is to capture the exact shape that worked, the commands that brought it up, and the failure modes that consumed time.
The setup exposes a stable OpenAI-compatible endpoint backed by two GPUs.
Components:
- llama.cpp server (inference)
- Open WebUI (frontend)
- optional Caddy (routing / TLS)
Shape:
- API β :8080
- UI β :3000
Optional hostnames:
- iona.local.arpeggio.one
- iona-api.local.arpeggio.one
The goal is not scale. The goal is control.
Model Choice (Where Reality Bites)
Working baseline:
- Gemma 4 26B A4B
- Q4 quantization
- multimodal projector (mmproj)
- ~128K context target
This is not arbitrary.
31B-class models were tested, including runs with mmproj, but on dual 2080 Ti the usable headroom narrowed too much once long context and KV cache were included in the budget. The problem was not benchmark performance in the abstract. The problem was operational headroom during real sessions.
26B remained the largest configuration that stayed comfortable enough to use repeatedly without turning every longer run into a fit problem.
The limiting factor was not just raw VRAM on paper. It was the combined effect of KV cache growth, context length, and the safety margin needed to keep the system usable instead of merely bootable.
Dual GPU Split
Required flag:
--split-mode layer
What this does:
- divides layers across both GPUs
- shares VRAM pressure
- introduces PCIe as the limiting factor
The split works by distributing layers across both cards. That buys enough room for the model, but it does not erase PCIe traffic or synchronization costs.
Consequences:
- cross-device latency exists
- batching makes those costs more visible
- higher concurrency increases coordination overhead between cards
The setup stays tractable because the moving parts are few: one inference container, one UI container, optional reverse proxy, fixed paths for model files, and a small set of known-good runtime flags.
Concurrency Trap
Setting:
--parallel 1
This is the one most people try to "improve".
Increasing it causes:
- KV cache fragmentation
- reduced usable context
- unstable latency
Keeping it at 1:
- preserves context
- keeps memory predictable
- avoids fragmentation-induced failures
Throughput tuning here trades away the only thing that matters: stability.
Flags That Actually Matter
Representative configuration:
--ctx-size 131072 --cache-type-k q4_0 --cache-type-v q4_0 --batch-size 1024 --ubatch-size 256 --fit on --fit-target 2048,1024
Interpretation:
- ctx-size β maximum working memory window
- KV quantization β makes large context possible
- batch / ubatch β throughput vs stability balance
- fit / fit-target β prevents OOM under pressure
If KV cache is not controlled, everything else is irrelevant.
API Layer (The Real Lever)
Endpoint:
http://127.x.x.x:8080/v1
Implications:
- any OpenAI-compatible client works
- existing tools integrate without modification
- local inference becomes infrastructure, not a toy
The model file is replaceable. The more durable value is the harness around it: the OpenAI-compatible /v1 surface, stable model aliases, predictable routing, and a UI that can be swapped or upgraded without redesigning the rest of the stack.
In SOLID terms, this is close to the value of a stable substitution boundary. If a component honors the same contract, the rest of the system can keep working while you change the implementation behind it.
What Breaks First
Observed failure modes:
- KV cache exhaustion under long sessions
- context degradation masked as "hallucination"
- GPU imbalance under uneven layer distribution
- latency spikes from cross-GPU sync
- WebUI mismatch (wrong endpoint or port mapping)
Most stability problems in this stack come from memory pressure rather than from the base model itself. The symptoms vary, but the underlying causes usually reduce to KV cache fit, context size, batching, split balance, or a model choice that leaves too little headroom for normal use.
Operational Reality
This stack does not behave like cloud inference.
It behaves like a constrained system:
- predictable when respected
- unstable when pushed past limits
You are trading elasticity for control.
That trade only works if you accept the constraints as first-class.
Why This Setup Exists At All
The practical win is straightforward.
On hardware that is now nearly a decade old, this stack can still cover a large share of the work many people actually do day to day: coding, shell work, log reading, config review, operational debugging, and structured writing grounded in supplied context. For that class of work, the result is not theoretical. It is a usable local system with good throughput, low marginal cost, and full control over prompts, files, and iteration loops.
The stack is useful because it combines:
- good enough model quality for real technical work
- local control over models and data
- a stable OpenAI-compatible harness that other tools can target
- no upload requirement for local files and images
- fast local iteration once the model is warm
Practical comparison: local Gemma 4 26B A4B vs GPT-4o
This is an operator estimate from day-to-day use on this stack, not a benchmark claim.
| Dimension | Local Gemma 4 26B A4B | GPT-4o | Notes |
|---|---|---|---|
| Coding and shell help | 80β90% | 100% | Good for implementation, refactors, shell commands, and code reading when the task is bounded and the context is prepared well. |
| Operational reasoning | 80β90% | 100% | Good for logs, service wiring, config review, and debugging workflows. The gap shows up more on broad, ambiguous, cross-domain problems. |
| Long-form structured writing | 75β85% | 100% | Solid when the structure and source material are provided. Less consistent on nuance, tone control, and longer argumentative flow. |
| Vision / multimodal | 60β75% | 100% | mmproj makes image support practical, but the ceiling is lower on harder image interpretation and mixed-modality work. |
| Broad world knowledge without supplied context | 65β80% | 100% | Local Gemma performs best when grounded in provided material. The gap widens when the task depends more on broad priors. |
| Response speed once warm | 100β130% | 100% | Around 90 tok/s locally is often faster than the perceived response rate of hosted chat systems for text generation. |
| Privacy / control | 130β150% | 100% | Local execution avoids sending prompts, files, and images off the box. |
| Marginal usage cost after setup | 140β160% | 100% | Once the hardware and stack are in place, repeated use is cheap enough to encourage experimentation without token budgeting. |
A practical summary is that local Gemma 4 26B lands in roughly the 80 to 90 percent band for many coding and operator-style tasks that fit the stack well, while also outperforming hosted usage on some dimensions that matter operationally: response speed after warmup, privacy, image locality, and marginal cost.
Boundaries
This stack is best treated as:
- a personal or small-team local inference layer
- a reusable OpenAI-compatible endpoint
- a controlled environment for technical and operational work
It is not optimized here for:
- multi-tenant serving
- horizontal scaling
- large shared hosted workloads
Those are different system goals and usually require different tradeoffs.
Bottom Line
This is a practical local inference layer built on old but still capable hardware.
It works well when:
- model size matches the available headroom
- concurrency stays modest
- memory is budgeted explicitly
- routing stays deterministic through fixed ports and clean hostnames
When those conditions are respected, the result is not a toy. It is a private OpenAI-compatible endpoint with long-context experimentation, multimodal support, and a harness you can keep improving as local models continue to get better.
Model Acquisition and Pointers
Download model + projector
python3 - <<'PY'
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="bartowski/google_gemma-4-26B-A4B-it-GGUF",
allow_patterns=[
"*Q4_0.gguf",
"mmproj-google_gemma-4-26B-A4B-it-f16.gguf",
],
local_dir="models/gemma4-26b",
local_dir_use_symlinks=False,
)
PY
Stable pointers (container reads fixed paths)
cd ~/local-ai/gemma4/models || exit 1
rm -f model.gguf mmproj.gguf
ln -s ./gemma4-26b/google_gemma-4-26B-A4B-it-Q4_0.gguf model.gguf
ln -s ./gemma4-26b/mmproj-google_gemma-4-26B-A4B-it-f16.gguf mmproj.gguf
Rules:
- keep symlinks relative
- targets must exist under the bind-mounted
/modelstree
Multimodal (Image) Support
Both files are required at runtime:
-m /models/model.gguf
--mmproj /models/mmproj.gguf
Sanity checks:
ls -lah ~/local-ai/gemma4/models/model.gguf
ls -lah ~/local-ai/gemma4/models/mmproj.gguf
Compose (API Server)
services:
gemma:
image: ghcr.io/ggml-org/llama.cpp:server-cuda
container_name: gemma-local
restart: unless-stopped
gpus: all
ports:
- "127.x.x.x:8080:8080"
env_file:
- .env
volumes:
- ./models:/models:ro
command: >
-m /models/model.gguf
--mmproj /models/mmproj.gguf
--alias gemma-4-26b-a4b
--host 0.0.0.0
--port 8080
--api-key ${LLM_API_KEY}
--ctx-size 131072
--parallel 1
--cache-type-k q4_0
--cache-type-v q4_0
--split-mode layer
--fit on
--fit-target 2048,1024
--batch-size 1024
--ubatch-size 256
Bring up + logs:
cd ~/local-ai/gemma4 || exit 1
docker compose up -d
docker compose logs -f gemma
Health checks:
curl -s http://127.x.x.x:8080/health && echo
curl -s http://127.x.x.x:8080/v1/models \
-H "Authorization: Bearer $(cat ~/local-ai/gemma4/.api_key)" && echo
Open WebUI (separate container)
docker run -d \
--name open-webui \
--restart unless-stopped \
-p 127.x.x.x:3000:8080 \
-v open-webui_open-webui:/app/backend/data \
-e WEBUI_SECRET_KEY="$(cat ~/.config/iona/open-webui.secret)" \
--add-host=host.docker.internal:host-gateway \
--add-host=iona-api.local.arpeggio.one:host-gateway \
--add-host=iona.local.arpeggio.one:host-gateway \
ghcr.io/open-webui/open-webui:v0.9.1
Confirm runtime:
docker ps --format 'table {{.Names}} {{.Image}} {{.Ports}} {{.Status}}'
Provider URL inside WebUI:
http://iona-api.local.arpeggio.one/v1
If models do not appear, add:
gemma-4-26b-a4bunder Model IDs filter
WebUI Upgrade and State Safety
Back up data volume:
mkdir -p ~/backups
docker run --rm \
-v open-webui_open-webui:/source:ro \
-v ~/backups:/backup \
alpine \
sh -c 'cd /source && tar czf /backup/open-webui-$(date +%Y%m%d-%H%M%S).tar.gz .'
Capture existing key:
SECRET_KEY="$(docker inspect open-webui --format '{{range .Config.Env}}{{println .}}{{end}}' | sed -n 's/^WEBUI_SECRET_KEY=//p' | head -n 1)"
Recreate:
docker pull ghcr.io/open-webui/open-webui:v0.9.1
docker rm -f open-webui
# re-run with same flags + preserved SECRET_KEY
If container is gone:
- restore volume from backup
- recover key from shell history
- last resort: new key (sessions break)
Caddy (full shape)
http://iona.local.arpeggio.one {
reverse_proxy 127.x.x.x:3000
}
https://iona.local.arpeggio.one {
tls internal
reverse_proxy 127.x.x.x:3000
}
http://iona-api.local.arpeggio.one {
reverse_proxy 127.x.x.x:8080
}
https://iona-api.local.arpeggio.one {
tls internal
reverse_proxy 127.x.x.x:8080
}
Validate + reload:
sudo caddy fmt --overwrite /etc/caddy/Caddyfile
sudo caddy validate --config /etc/caddy/Caddyfile
sudo systemctl reload caddy
Socket checks:
sudo ss -ltnp | grep -E ':3000|:8080|:80|:443'
Resolve checks:
curl --resolve iona.local.arpeggio.one:80:127.x.x.x http://iona.local.arpeggio.one/
curl -k --resolve iona.local.arpeggio.one:443:127.x.x.x https://iona.local.arpeggio.one/
Hotspot + DNS (optional but part of system)
Start hotspot:
sudo nmcli dev wifi hotspot \
ifname wl**********c6 \
ssid MoonshotBridge \
password "YOUR_PASSWORD"
Inject DNS:
sudo mkdir -p /etc/NetworkManager/dnsmasq-shared.d
sudo tee /etc/NetworkManager/dnsmasq-shared.d/iona.conf >/dev/null <<'EOF'
address=/iona.local.arpeggio.one/10.x.x.x
address=/iona-api.local.arpeggio.one/10.x.x.x
EOF
Verify:
dig @10.x.x.x iona.local.arpeggio.one
systemd Wrapper (optional)
[Unit]
Description=Iona local AI containers
Requires=docker.service
After=docker.service network-online.target
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/docker start open-webui gemma-local
ExecStop=/usr/bin/docker stop open-webui gemma-local
[Install]
WantedBy=multi-user.target
Stop / Restart
docker stop open-webui gemma-local
docker start open-webui gemma-local
cd ~/local-ai/gemma4
docker compose stop gemma
docker compose up -d
sudo systemctl restart caddy
Operational Checklist
docker ps --format 'table {{.Names}} {{.Status}} {{.Ports}}'
sudo ss -ltnp | grep -E ':3000|:8080|:80|:443'
getent hosts iona.local.arpeggio.one iona-api.local.arpeggio.one
curl -s http://127.x.x.x:8080/health
Practical Boundaries
Chosen:
- 26B daily driver
- local-only routing
- manual memory fallback
Deferred:
- 31B daily driver
- public exposure
- autonomous memory reliance
TSG
Models do not appear in Open WebUI
Symptom
- the provider connection looks valid, but the model list is empty
Likely causes
- provider URL is wrong inside the container
/v1was omitted- saved provider state in WebUI is stale
- the model ID needs to be surfaced manually
Checks
curl -s http://127.x.x.x:8080/v1/models \
-H "Authorization: Bearer $(cat ~/local-ai/gemma4/.api_key)" && echo
Inside WebUI, verify the provider URL is:
http://iona-api.local.arpeggio.one/v1
Recovery
- fix the provider URL
- recreate the provider entry if the saved state looks stale
- add
gemma-4-26b-a4bunder Model IDs filter if discovery still fails
Open WebUI shows a network error when calling the model
Symptom
- UI loads, but requests to the local provider fail with a network error
Likely causes
- the container cannot reach the host-side API
localhostor127.x.x.xwas used inside the WebUI container- host-gateway mappings were not added at container start
Checks
Inspect the container launch flags and confirm these host mappings exist:
--add-host=host.docker.internal:host-gateway
--add-host=iona-api.local.arpeggio.one:host-gateway
--add-host=iona.local.arpeggio.one:host-gateway
Recovery
- use
http://iona-api.local.arpeggio.one/v1in WebUI, not host loopback - recreate the WebUI container with the host-gateway mappings if they were omitted
KV allocation or buffer allocation fails
Symptom
failed to allocate buffer for kv cache- model starts inconsistently or fails during larger runs
Likely causes
- context is set too high for the model + cache combination
- the chosen model leaves too little VRAM headroom
- batching is too aggressive for the available fit
Checks
Review the runtime flags in use:
--ctx-size--cache-type-k--cache-type-v--batch-size--ubatch-size--fit-target
Recovery
- step down to the 26B configuration if you drifted upward
- reduce context before changing many things at once
- keep KV cache quantized
- avoid simultaneous changes to model size, context, and batching because that makes failures harder to localize
GGUF file cannot be opened
Symptom
failed to open GGUF file '/models/model.gguf'
Likely causes
- the symlink points outside the bind-mounted
/modelstree - the symlink is absolute when it should be relative
- the target file was renamed or moved
Checks
ls -lah ~/local-ai/gemma4/models/model.gguf
ls -lah ~/local-ai/gemma4/models/mmproj.gguf
Recovery
Recreate the stable pointers as relative symlinks:
cd ~/local-ai/gemma4/models || exit 1
rm -f model.gguf mmproj.gguf
ln -s ./gemma4-26b/google_gemma-4-26B-A4B-it-Q4_0.gguf model.gguf
ln -s ./gemma4-26b/mmproj-google_gemma-4-26B-A4B-it-f16.gguf mmproj.gguf
Browser warns on HTTPS or local names fail to resolve
Symptom
- browser warns on
https://iona.local.arpeggio.one - the router box gets NXDOMAIN while hotspot clients work
Likely causes
- Caddy internal CA is not yet trusted on the client
- the host machine is using its own resolver instead of the hotspot DNS responder
Checks
curl --resolve iona.local.arpeggio.one:80:127.x.x.x http://iona.local.arpeggio.one/
curl -k --resolve iona.local.arpeggio.one:443:127.x.x.x https://iona.local.arpeggio.one/
dig @10.x.x.x iona.local.arpeggio.one
dig @10.x.x.x iona-api.local.arpeggio.one
Recovery
- trust Caddy's internal CA on the client, or use HTTP temporarily
- if the host itself cannot resolve the names, add a host-only fallback:
echo '10.x.x.x iona.local.arpeggio.one iona-api.local.arpeggio.one' | sudo tee -a /etc/hosts
Open WebUI upgrade breaks sessions or stored secrets
Symptom
- users are logged out after recreation
- saved tokens or secrets no longer decrypt correctly
Likely causes
WEBUI_SECRET_KEYchanged during container recreation- the data volume was recreated without a matching key
Checks
Capture the existing key before deleting the container:
SECRET_KEY="$(docker inspect open-webui --format '{{range .Config.Env}}{{println .}}{{end}}' | sed -n 's/^WEBUI_SECRET_KEY=//p' | head -n 1)"
[ -n "$SECRET_KEY" ] || { echo "WEBUI_SECRET_KEY not found"; exit 1; }
Inspect the actual mount in use:
docker inspect open-webui --format '{{range .Mounts}}{{println .Type .Name .Source "->" .Destination}}{{end}}'
Recovery
- preserve the original
WEBUI_SECRET_KEYwhen recreating the container - back up the data volume before upgrade
- if the container is already gone, restore the volume first and recover the key from container env or shell history if possible
Hotspot comes back but local names do not work for clients
Symptom
- clients join the hotspot, but
iona.local.arpeggio.onedoes not resolve
Likely causes
- the dnsmasq shared config was not written or not reloaded
- NetworkManager was restarted but the hotspot profile was not brought back up
Checks
sudo grep -R "iona.local.arpeggio.one" /etc/NetworkManager/dnsmasq-shared.d
dig @10.x.x.x iona.local.arpeggio.one
Recovery
sudo systemctl restart NetworkManager
sudo nmcli connection up "Hotspot"
sudo nmcli connection up "Hotspot-2"
If the config is missing, recreate it:
sudo mkdir -p /etc/NetworkManager/dnsmasq-shared.d
sudo tee /etc/NetworkManager/dnsmasq-shared.d/iona.conf >/dev/null <<'EOF'
address=/iona.local.arpeggio.one/10.x.x.x
address=/iona-api.local.arpeggio.one/10.x.x.x
EOF
Trivia: Why βIonaβ and βArpeggioβ fit
The names are a small nod to Arpeggio of Blue Steel, where Iona is one of the central mental models and the title itself carries the maritime frame.
That makes the pairing unusually apt here.
- Arpeggio already fits the broader project identity: structured progression, navigation, cadence, and movement across a chart rather than a static point.
- Iona fits the local stack: compact, capable, calm under normal operation, and most useful when treated as part of a vesselβs working system rather than as magic.
There is also a mild joke in it. In the anime, Iona is a warship avatar. Here, βIonaβ ended up as a local model stack running on aging GPUs, reverse proxies, DNS glue, and enough shell commands to annoy a normal person. The name overshoots the hardware a little, which is part of the charm.