general 2026-04-23 Β· Updated 2026-04-23

Running a 2-GPU Local LLM (Iona Stack)

iona

What This Actually Is

This page documents a working single-machine local LLM stack built around llama.cpp, Open WebUI, and optional Caddy. The point is to capture the exact shape that worked, the commands that brought it up, and the failure modes that consumed time.

The setup exposes a stable OpenAI-compatible endpoint backed by two GPUs.

Components:

  • llama.cpp server (inference)
  • Open WebUI (frontend)
  • optional Caddy (routing / TLS)

Shape:

  • API β†’ :8080
  • UI β†’ :3000

Optional hostnames:

  • iona.local.arpeggio.one
  • iona-api.local.arpeggio.one

The goal is not scale. The goal is control.


Model Choice (Where Reality Bites)

Working baseline:

  • Gemma 4 26B A4B
  • Q4 quantization
  • multimodal projector (mmproj)
  • ~128K context target

This is not arbitrary.

31B-class models were tested, including runs with mmproj, but on dual 2080 Ti the usable headroom narrowed too much once long context and KV cache were included in the budget. The problem was not benchmark performance in the abstract. The problem was operational headroom during real sessions.

26B remained the largest configuration that stayed comfortable enough to use repeatedly without turning every longer run into a fit problem.

The limiting factor was not just raw VRAM on paper. It was the combined effect of KV cache growth, context length, and the safety margin needed to keep the system usable instead of merely bootable.


Dual GPU Split

Required flag:

--split-mode layer

What this does:

  • divides layers across both GPUs
  • shares VRAM pressure
  • introduces PCIe as the limiting factor

The split works by distributing layers across both cards. That buys enough room for the model, but it does not erase PCIe traffic or synchronization costs.

Consequences:

  • cross-device latency exists
  • batching makes those costs more visible
  • higher concurrency increases coordination overhead between cards

The setup stays tractable because the moving parts are few: one inference container, one UI container, optional reverse proxy, fixed paths for model files, and a small set of known-good runtime flags.


Concurrency Trap

Setting:

--parallel 1

This is the one most people try to "improve".

Increasing it causes:

  • KV cache fragmentation
  • reduced usable context
  • unstable latency

Keeping it at 1:

  • preserves context
  • keeps memory predictable
  • avoids fragmentation-induced failures

Throughput tuning here trades away the only thing that matters: stability.


Flags That Actually Matter

Representative configuration:

--ctx-size 131072 --cache-type-k q4_0 --cache-type-v q4_0 --batch-size 1024 --ubatch-size 256 --fit on --fit-target 2048,1024

Interpretation:

  • ctx-size β†’ maximum working memory window
  • KV quantization β†’ makes large context possible
  • batch / ubatch β†’ throughput vs stability balance
  • fit / fit-target β†’ prevents OOM under pressure

If KV cache is not controlled, everything else is irrelevant.


API Layer (The Real Lever)

Endpoint:

http://127.x.x.x:8080/v1

Implications:

  • any OpenAI-compatible client works
  • existing tools integrate without modification
  • local inference becomes infrastructure, not a toy

The model file is replaceable. The more durable value is the harness around it: the OpenAI-compatible /v1 surface, stable model aliases, predictable routing, and a UI that can be swapped or upgraded without redesigning the rest of the stack.

In SOLID terms, this is close to the value of a stable substitution boundary. If a component honors the same contract, the rest of the system can keep working while you change the implementation behind it.


What Breaks First

Observed failure modes:

  • KV cache exhaustion under long sessions
  • context degradation masked as "hallucination"
  • GPU imbalance under uneven layer distribution
  • latency spikes from cross-GPU sync
  • WebUI mismatch (wrong endpoint or port mapping)

Most stability problems in this stack come from memory pressure rather than from the base model itself. The symptoms vary, but the underlying causes usually reduce to KV cache fit, context size, batching, split balance, or a model choice that leaves too little headroom for normal use.


Operational Reality

This stack does not behave like cloud inference.

It behaves like a constrained system:

  • predictable when respected
  • unstable when pushed past limits

You are trading elasticity for control.

That trade only works if you accept the constraints as first-class.


Why This Setup Exists At All

The practical win is straightforward.

On hardware that is now nearly a decade old, this stack can still cover a large share of the work many people actually do day to day: coding, shell work, log reading, config review, operational debugging, and structured writing grounded in supplied context. For that class of work, the result is not theoretical. It is a usable local system with good throughput, low marginal cost, and full control over prompts, files, and iteration loops.

The stack is useful because it combines:

  • good enough model quality for real technical work
  • local control over models and data
  • a stable OpenAI-compatible harness that other tools can target
  • no upload requirement for local files and images
  • fast local iteration once the model is warm

Practical comparison: local Gemma 4 26B A4B vs GPT-4o

This is an operator estimate from day-to-day use on this stack, not a benchmark claim.

Dimension Local Gemma 4 26B A4B GPT-4o Notes
Coding and shell help 80–90% 100% Good for implementation, refactors, shell commands, and code reading when the task is bounded and the context is prepared well.
Operational reasoning 80–90% 100% Good for logs, service wiring, config review, and debugging workflows. The gap shows up more on broad, ambiguous, cross-domain problems.
Long-form structured writing 75–85% 100% Solid when the structure and source material are provided. Less consistent on nuance, tone control, and longer argumentative flow.
Vision / multimodal 60–75% 100% mmproj makes image support practical, but the ceiling is lower on harder image interpretation and mixed-modality work.
Broad world knowledge without supplied context 65–80% 100% Local Gemma performs best when grounded in provided material. The gap widens when the task depends more on broad priors.
Response speed once warm 100–130% 100% Around 90 tok/s locally is often faster than the perceived response rate of hosted chat systems for text generation.
Privacy / control 130–150% 100% Local execution avoids sending prompts, files, and images off the box.
Marginal usage cost after setup 140–160% 100% Once the hardware and stack are in place, repeated use is cheap enough to encourage experimentation without token budgeting.

A practical summary is that local Gemma 4 26B lands in roughly the 80 to 90 percent band for many coding and operator-style tasks that fit the stack well, while also outperforming hosted usage on some dimensions that matter operationally: response speed after warmup, privacy, image locality, and marginal cost.

Boundaries

This stack is best treated as:

  • a personal or small-team local inference layer
  • a reusable OpenAI-compatible endpoint
  • a controlled environment for technical and operational work

It is not optimized here for:

  • multi-tenant serving
  • horizontal scaling
  • large shared hosted workloads

Those are different system goals and usually require different tradeoffs.

Bottom Line

This is a practical local inference layer built on old but still capable hardware.

It works well when:

  1. model size matches the available headroom
  2. concurrency stays modest
  3. memory is budgeted explicitly
  4. routing stays deterministic through fixed ports and clean hostnames

When those conditions are respected, the result is not a toy. It is a private OpenAI-compatible endpoint with long-context experimentation, multimodal support, and a harness you can keep improving as local models continue to get better.


Model Acquisition and Pointers

Download model + projector

python3 - <<'PY'
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="bartowski/google_gemma-4-26B-A4B-it-GGUF",
    allow_patterns=[
        "*Q4_0.gguf",
        "mmproj-google_gemma-4-26B-A4B-it-f16.gguf",
    ],
    local_dir="models/gemma4-26b",
    local_dir_use_symlinks=False,
)
PY

Stable pointers (container reads fixed paths)

cd ~/local-ai/gemma4/models || exit 1
rm -f model.gguf mmproj.gguf
ln -s ./gemma4-26b/google_gemma-4-26B-A4B-it-Q4_0.gguf model.gguf
ln -s ./gemma4-26b/mmproj-google_gemma-4-26B-A4B-it-f16.gguf mmproj.gguf

Rules:

  • keep symlinks relative
  • targets must exist under the bind-mounted /models tree

Multimodal (Image) Support

Both files are required at runtime:

-m /models/model.gguf
--mmproj /models/mmproj.gguf

Sanity checks:

ls -lah ~/local-ai/gemma4/models/model.gguf
ls -lah ~/local-ai/gemma4/models/mmproj.gguf

Compose (API Server)

services:
  gemma:
    image: ghcr.io/ggml-org/llama.cpp:server-cuda
    container_name: gemma-local
    restart: unless-stopped
    gpus: all
    ports:
      - "127.x.x.x:8080:8080"
    env_file:
      - .env
    volumes:
      - ./models:/models:ro
    command: >
      -m /models/model.gguf
      --mmproj /models/mmproj.gguf
      --alias gemma-4-26b-a4b
      --host 0.0.0.0
      --port 8080
      --api-key ${LLM_API_KEY}
      --ctx-size 131072
      --parallel 1
      --cache-type-k q4_0
      --cache-type-v q4_0
      --split-mode layer
      --fit on
      --fit-target 2048,1024
      --batch-size 1024
      --ubatch-size 256

Bring up + logs:

cd ~/local-ai/gemma4 || exit 1
docker compose up -d
docker compose logs -f gemma

Health checks:

curl -s http://127.x.x.x:8080/health && echo
curl -s http://127.x.x.x:8080/v1/models \
  -H "Authorization: Bearer $(cat ~/local-ai/gemma4/.api_key)" && echo

Open WebUI (separate container)

docker run -d \
  --name open-webui \
  --restart unless-stopped \
  -p 127.x.x.x:3000:8080 \
  -v open-webui_open-webui:/app/backend/data \
  -e WEBUI_SECRET_KEY="$(cat ~/.config/iona/open-webui.secret)" \
  --add-host=host.docker.internal:host-gateway \
  --add-host=iona-api.local.arpeggio.one:host-gateway \
  --add-host=iona.local.arpeggio.one:host-gateway \
  ghcr.io/open-webui/open-webui:v0.9.1

Confirm runtime:

docker ps --format 'table {{.Names}}	{{.Image}}	{{.Ports}}	{{.Status}}'

Provider URL inside WebUI:

http://iona-api.local.arpeggio.one/v1

If models do not appear, add:

  • gemma-4-26b-a4b under Model IDs filter

WebUI Upgrade and State Safety

Back up data volume:

mkdir -p ~/backups

docker run --rm \
  -v open-webui_open-webui:/source:ro \
  -v ~/backups:/backup \
  alpine \
  sh -c 'cd /source && tar czf /backup/open-webui-$(date +%Y%m%d-%H%M%S).tar.gz .'

Capture existing key:

SECRET_KEY="$(docker inspect open-webui --format '{{range .Config.Env}}{{println .}}{{end}}' | sed -n 's/^WEBUI_SECRET_KEY=//p' | head -n 1)"

Recreate:

docker pull ghcr.io/open-webui/open-webui:v0.9.1
docker rm -f open-webui
# re-run with same flags + preserved SECRET_KEY

If container is gone:

  • restore volume from backup
  • recover key from shell history
  • last resort: new key (sessions break)

Caddy (full shape)

http://iona.local.arpeggio.one {
    reverse_proxy 127.x.x.x:3000
}

https://iona.local.arpeggio.one {
    tls internal
    reverse_proxy 127.x.x.x:3000
}

http://iona-api.local.arpeggio.one {
    reverse_proxy 127.x.x.x:8080
}

https://iona-api.local.arpeggio.one {
    tls internal
    reverse_proxy 127.x.x.x:8080
}

Validate + reload:

sudo caddy fmt --overwrite /etc/caddy/Caddyfile
sudo caddy validate --config /etc/caddy/Caddyfile
sudo systemctl reload caddy

Socket checks:

sudo ss -ltnp | grep -E ':3000|:8080|:80|:443'

Resolve checks:

curl --resolve iona.local.arpeggio.one:80:127.x.x.x http://iona.local.arpeggio.one/
curl -k --resolve iona.local.arpeggio.one:443:127.x.x.x https://iona.local.arpeggio.one/

Hotspot + DNS (optional but part of system)

Start hotspot:

sudo nmcli dev wifi hotspot \
  ifname wl**********c6 \
  ssid MoonshotBridge \
  password "YOUR_PASSWORD"

Inject DNS:

sudo mkdir -p /etc/NetworkManager/dnsmasq-shared.d
sudo tee /etc/NetworkManager/dnsmasq-shared.d/iona.conf >/dev/null <<'EOF'
address=/iona.local.arpeggio.one/10.x.x.x
address=/iona-api.local.arpeggio.one/10.x.x.x
EOF

Verify:

dig @10.x.x.x iona.local.arpeggio.one

systemd Wrapper (optional)

[Unit]
Description=Iona local AI containers
Requires=docker.service
After=docker.service network-online.target

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/docker start open-webui gemma-local
ExecStop=/usr/bin/docker stop open-webui gemma-local

[Install]
WantedBy=multi-user.target

Stop / Restart

docker stop open-webui gemma-local
docker start open-webui gemma-local
cd ~/local-ai/gemma4
docker compose stop gemma
docker compose up -d
sudo systemctl restart caddy

Operational Checklist

docker ps --format 'table {{.Names}}	{{.Status}}	{{.Ports}}'
sudo ss -ltnp | grep -E ':3000|:8080|:80|:443'
getent hosts iona.local.arpeggio.one iona-api.local.arpeggio.one
curl -s http://127.x.x.x:8080/health

Practical Boundaries

Chosen:

  • 26B daily driver
  • local-only routing
  • manual memory fallback

Deferred:

  • 31B daily driver
  • public exposure
  • autonomous memory reliance

TSG

Models do not appear in Open WebUI

Symptom

  • the provider connection looks valid, but the model list is empty

Likely causes

  • provider URL is wrong inside the container
  • /v1 was omitted
  • saved provider state in WebUI is stale
  • the model ID needs to be surfaced manually

Checks

curl -s http://127.x.x.x:8080/v1/models \
  -H "Authorization: Bearer $(cat ~/local-ai/gemma4/.api_key)" && echo

Inside WebUI, verify the provider URL is:

http://iona-api.local.arpeggio.one/v1

Recovery

  • fix the provider URL
  • recreate the provider entry if the saved state looks stale
  • add gemma-4-26b-a4b under Model IDs filter if discovery still fails

Open WebUI shows a network error when calling the model

Symptom

  • UI loads, but requests to the local provider fail with a network error

Likely causes

  • the container cannot reach the host-side API
  • localhost or 127.x.x.x was used inside the WebUI container
  • host-gateway mappings were not added at container start

Checks

Inspect the container launch flags and confirm these host mappings exist:

--add-host=host.docker.internal:host-gateway
--add-host=iona-api.local.arpeggio.one:host-gateway
--add-host=iona.local.arpeggio.one:host-gateway

Recovery

  • use http://iona-api.local.arpeggio.one/v1 in WebUI, not host loopback
  • recreate the WebUI container with the host-gateway mappings if they were omitted

KV allocation or buffer allocation fails

Symptom

  • failed to allocate buffer for kv cache
  • model starts inconsistently or fails during larger runs

Likely causes

  • context is set too high for the model + cache combination
  • the chosen model leaves too little VRAM headroom
  • batching is too aggressive for the available fit

Checks

Review the runtime flags in use:

  • --ctx-size
  • --cache-type-k
  • --cache-type-v
  • --batch-size
  • --ubatch-size
  • --fit-target

Recovery

  • step down to the 26B configuration if you drifted upward
  • reduce context before changing many things at once
  • keep KV cache quantized
  • avoid simultaneous changes to model size, context, and batching because that makes failures harder to localize

GGUF file cannot be opened

Symptom

  • failed to open GGUF file '/models/model.gguf'

Likely causes

  • the symlink points outside the bind-mounted /models tree
  • the symlink is absolute when it should be relative
  • the target file was renamed or moved

Checks

ls -lah ~/local-ai/gemma4/models/model.gguf
ls -lah ~/local-ai/gemma4/models/mmproj.gguf

Recovery

Recreate the stable pointers as relative symlinks:

cd ~/local-ai/gemma4/models || exit 1
rm -f model.gguf mmproj.gguf
ln -s ./gemma4-26b/google_gemma-4-26B-A4B-it-Q4_0.gguf model.gguf
ln -s ./gemma4-26b/mmproj-google_gemma-4-26B-A4B-it-f16.gguf mmproj.gguf

Browser warns on HTTPS or local names fail to resolve

Symptom

  • browser warns on https://iona.local.arpeggio.one
  • the router box gets NXDOMAIN while hotspot clients work

Likely causes

  • Caddy internal CA is not yet trusted on the client
  • the host machine is using its own resolver instead of the hotspot DNS responder

Checks

curl --resolve iona.local.arpeggio.one:80:127.x.x.x http://iona.local.arpeggio.one/
curl -k --resolve iona.local.arpeggio.one:443:127.x.x.x https://iona.local.arpeggio.one/
dig @10.x.x.x iona.local.arpeggio.one
dig @10.x.x.x iona-api.local.arpeggio.one

Recovery

  • trust Caddy's internal CA on the client, or use HTTP temporarily
  • if the host itself cannot resolve the names, add a host-only fallback:
echo '10.x.x.x iona.local.arpeggio.one iona-api.local.arpeggio.one' | sudo tee -a /etc/hosts

Open WebUI upgrade breaks sessions or stored secrets

Symptom

  • users are logged out after recreation
  • saved tokens or secrets no longer decrypt correctly

Likely causes

  • WEBUI_SECRET_KEY changed during container recreation
  • the data volume was recreated without a matching key

Checks

Capture the existing key before deleting the container:

SECRET_KEY="$(docker inspect open-webui --format '{{range .Config.Env}}{{println .}}{{end}}' | sed -n 's/^WEBUI_SECRET_KEY=//p' | head -n 1)"
[ -n "$SECRET_KEY" ] || { echo "WEBUI_SECRET_KEY not found"; exit 1; }

Inspect the actual mount in use:

docker inspect open-webui --format '{{range .Mounts}}{{println .Type .Name .Source "->" .Destination}}{{end}}'

Recovery

  • preserve the original WEBUI_SECRET_KEY when recreating the container
  • back up the data volume before upgrade
  • if the container is already gone, restore the volume first and recover the key from container env or shell history if possible

Hotspot comes back but local names do not work for clients

Symptom

  • clients join the hotspot, but iona.local.arpeggio.one does not resolve

Likely causes

  • the dnsmasq shared config was not written or not reloaded
  • NetworkManager was restarted but the hotspot profile was not brought back up

Checks

sudo grep -R "iona.local.arpeggio.one" /etc/NetworkManager/dnsmasq-shared.d
dig @10.x.x.x iona.local.arpeggio.one

Recovery

sudo systemctl restart NetworkManager
sudo nmcli connection up "Hotspot"
sudo nmcli connection up "Hotspot-2"

If the config is missing, recreate it:

sudo mkdir -p /etc/NetworkManager/dnsmasq-shared.d
sudo tee /etc/NetworkManager/dnsmasq-shared.d/iona.conf >/dev/null <<'EOF'
address=/iona.local.arpeggio.one/10.x.x.x
address=/iona-api.local.arpeggio.one/10.x.x.x
EOF

Trivia: Why β€œIona” and β€œArpeggio” fit

The names are a small nod to Arpeggio of Blue Steel, where Iona is one of the central mental models and the title itself carries the maritime frame.

That makes the pairing unusually apt here.

  • Arpeggio already fits the broader project identity: structured progression, navigation, cadence, and movement across a chart rather than a static point.
  • Iona fits the local stack: compact, capable, calm under normal operation, and most useful when treated as part of a vessel’s working system rather than as magic.

There is also a mild joke in it. In the anime, Iona is a warship avatar. Here, β€œIona” ended up as a local model stack running on aging GPUs, reverse proxies, DNS glue, and enough shell commands to annoy a normal person. The name overshoots the hardware a little, which is part of the charm.