general 2026-04-23 · Updated 2026-04-23

Running a 2-GPU Local LLM (Iona Stack)

iona

What This Actually Is

This page documents a working single-machine local LLM stack built around llama.cpp, Open WebUI, and optional Caddy. The point is to capture the exact shape that worked, the commands that brought it up, and the failure modes that consumed time.

The setup exposes a stable OpenAI-compatible endpoint backed by two GPUs.

Components:

llama.cpp server (inference)
Open WebUI (frontend)
optional Caddy (routing / TLS)

Shape:

API → :8080
UI → :3000

Optional hostnames:

iona.local.arpeggio.one
iona-api.local.arpeggio.one

The goal is not scale. The goal is control.

Model Choice (Where Reality Bites)

Working baseline:

Gemma 4 26B A4B
Q4 quantization
multimodal projector (mmproj)
~128K context target

This is not arbitrary.

31B-class models were tested, including runs with mmproj, but on dual 2080 Ti the usable headroom narrowed too much once long context and KV cache were included in the budget. The problem was not benchmark performance in the abstract. The problem was operational headroom during real sessions.

26B remained the largest configuration that stayed comfortable enough to use repeatedly without turning every longer run into a fit problem.

The limiting factor was not just raw VRAM on paper. It was the combined effect of KV cache growth, context length, and the safety margin needed to keep the system usable instead of merely bootable.

Dual GPU Split

Required flag:

--split-mode layer

What this does:

divides layers across both GPUs
shares VRAM pressure
introduces PCIe as the limiting factor

The split works by distributing layers across both cards. That buys enough room for the model, but it does not erase PCIe traffic or synchronization costs.

Consequences:

cross-device latency exists
batching makes those costs more visible
higher concurrency increases coordination overhead between cards

The setup stays tractable because the moving parts are few: one inference container, one UI container, optional reverse proxy, fixed paths for model files, and a small set of known-good runtime flags.

Concurrency Trap

Setting:

--parallel 1

This is the one most people try to "improve".

Increasing it causes:

KV cache fragmentation
reduced usable context
unstable latency

Keeping it at 1:

preserves context
keeps memory predictable
avoids fragmentation-induced failures

Throughput tuning here trades away the only thing that matters: stability.

Flags That Actually Matter

Representative configuration:

--ctx-size 131072 --cache-type-k q4_0 --cache-type-v q4_0 --batch-size 1024 --ubatch-size 256 --fit on --fit-target 2048,1024

Interpretation:

ctx-size → maximum working memory window
KV quantization → makes large context possible
batch / ubatch → throughput vs stability balance
fit / fit-target → prevents OOM under pressure

If KV cache is not controlled, everything else is irrelevant.

API Layer (The Real Lever)

Endpoint:

http://127.x.x.x:8080/v1

Implications:

any OpenAI-compatible client works
existing tools integrate without modification
local inference becomes infrastructure, not a toy

The model file is replaceable. The more durable value is the harness around it: the OpenAI-compatible /v1 surface, stable model aliases, predictable routing, and a UI that can be swapped or upgraded without redesigning the rest of the stack.

In SOLID terms, this is close to the value of a stable substitution boundary. If a component honors the same contract, the rest of the system can keep working while you change the implementation behind it.

What Breaks First

Observed failure modes:

KV cache exhaustion under long sessions
context degradation masked as "hallucination"
GPU imbalance under uneven layer distribution
latency spikes from cross-GPU sync
WebUI mismatch (wrong endpoint or port mapping)

Most stability problems in this stack come from memory pressure rather than from the base model itself. The symptoms vary, but the underlying causes usually reduce to KV cache fit, context size, batching, split balance, or a model choice that leaves too little headroom for normal use.

Operational Reality

This stack does not behave like cloud inference.

It behaves like a constrained system:

predictable when respected
unstable when pushed past limits

You are trading elasticity for control.

That trade only works if you accept the constraints as first-class.

Why This Setup Exists At All

The practical win is straightforward.

On hardware that is now nearly a decade old, this stack can still cover a large share of the work many people actually do day to day: coding, shell work, log reading, config review, operational debugging, and structured writing grounded in supplied context. For that class of work, the result is not theoretical. It is a usable local system with good throughput, low marginal cost, and full control over prompts, files, and iteration loops.

The stack is useful because it combines:

good enough model quality for real technical work
local control over models and data
a stable OpenAI-compatible harness that other tools can target
no upload requirement for local files and images
fast local iteration once the model is warm

Practical comparison: local Gemma 4 26B A4B vs GPT-4o

This is an operator estimate from day-to-day use on this stack, not a benchmark claim.

Dimension	Local Gemma 4 26B A4B	GPT-4o	Notes
Coding and shell help	80–90%	100%	Good for implementation, refactors, shell commands, and code reading when the task is bounded and the context is prepared well.
Operational reasoning	80–90%	100%	Good for logs, service wiring, config review, and debugging workflows. The gap shows up more on broad, ambiguous, cross-domain problems.
Long-form structured writing	75–85%	100%	Solid when the structure and source material are provided. Less consistent on nuance, tone control, and longer argumentative flow.
Vision / multimodal	60–75%	100%	`mmproj` makes image support practical, but the ceiling is lower on harder image interpretation and mixed-modality work.
Broad world knowledge without supplied context	65–80%	100%	Local Gemma performs best when grounded in provided material. The gap widens when the task depends more on broad priors.
Response speed once warm	100–130%	100%	Around 90 tok/s locally is often faster than the perceived response rate of hosted chat systems for text generation.
Privacy / control	130–150%	100%	Local execution avoids sending prompts, files, and images off the box.
Marginal usage cost after setup	140–160%	100%	Once the hardware and stack are in place, repeated use is cheap enough to encourage experimentation without token budgeting.

A practical summary is that local Gemma 4 26B lands in roughly the 80 to 90 percent band for many coding and operator-style tasks that fit the stack well, while also outperforming hosted usage on some dimensions that matter operationally: response speed after warmup, privacy, image locality, and marginal cost.

Boundaries

This stack is best treated as:

a personal or small-team local inference layer
a reusable OpenAI-compatible endpoint
a controlled environment for technical and operational work

It is not optimized here for:

multi-tenant serving
horizontal scaling
large shared hosted workloads

Those are different system goals and usually require different tradeoffs.

Bottom Line

This is a practical local inference layer built on old but still capable hardware.

It works well when:

model size matches the available headroom
concurrency stays modest
memory is budgeted explicitly
routing stays deterministic through fixed ports and clean hostnames

When those conditions are respected, the result is not a toy. It is a private OpenAI-compatible endpoint with long-context experimentation, multimodal support, and a harness you can keep improving as local models continue to get better.

Model Acquisition and Pointers

Download model + projector

python3 - <<'PY'
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="bartowski/google_gemma-4-26B-A4B-it-GGUF",
    allow_patterns=[
        "*Q4_0.gguf",
        "mmproj-google_gemma-4-26B-A4B-it-f16.gguf",
    ],
    local_dir="models/gemma4-26b",
    local_dir_use_symlinks=False,
)
PY

Stable pointers (container reads fixed paths)

cd ~/local-ai/gemma4/models || exit 1
rm -f model.gguf mmproj.gguf
ln -s ./gemma4-26b/google_gemma-4-26B-A4B-it-Q4_0.gguf model.gguf
ln -s ./gemma4-26b/mmproj-google_gemma-4-26B-A4B-it-f16.gguf mmproj.gguf

Rules:

keep symlinks relative
targets must exist under the bind-mounted /models tree

Multimodal (Image) Support

Both files are required at runtime:

-m /models/model.gguf
--mmproj /models/mmproj.gguf

Sanity checks:

ls -lah ~/local-ai/gemma4/models/model.gguf
ls -lah ~/local-ai/gemma4/models/mmproj.gguf

Compose (API Server)

services:
  gemma:
    image: ghcr.io/ggml-org/llama.cpp:server-cuda
    container_name: gemma-local
    restart: unless-stopped
    gpus: all
    ports:
      - "127.x.x.x:8080:8080"
    env_file:
      - .env
    volumes:
      - ./models:/models:ro
    command: >
      -m /models/model.gguf
      --mmproj /models/mmproj.gguf
      --alias gemma-4-26b-a4b
      --host 0.0.0.0
      --port 8080
      --api-key ${LLM_API_KEY}
      --ctx-size 131072
      --parallel 1
      --cache-type-k q4_0
      --cache-type-v q4_0
      --split-mode layer
      --fit on
      --fit-target 2048,1024
      --batch-size 1024
      --ubatch-size 256

Bring up + logs:

cd ~/local-ai/gemma4 || exit 1
docker compose up -d
docker compose logs -f gemma

Health checks:

curl -s http://127.x.x.x:8080/health && echo
curl -s http://127.x.x.x:8080/v1/models \
  -H "Authorization: Bearer $(cat ~/local-ai/gemma4/.api_key)" && echo

Open WebUI (separate container)

docker run -d \
  --name open-webui \
  --restart unless-stopped \
  -p 127.x.x.x:3000:8080 \
  -v open-webui_open-webui:/app/backend/data \
  -e WEBUI_SECRET_KEY="$(cat ~/.config/iona/open-webui.secret)" \
  --add-host=host.docker.internal:host-gateway \
  --add-host=iona-api.local.arpeggio.one:host-gateway \
  --add-host=iona.local.arpeggio.one:host-gateway \
  ghcr.io/open-webui/open-webui:v0.9.1

Confirm runtime:

docker ps --format 'table {{.Names}}	{{.Image}}	{{.Ports}}	{{.Status}}'

Provider URL inside WebUI:

http://iona-api.local.arpeggio.one/v1

If models do not appear, add:

gemma-4-26b-a4b under Model IDs filter

WebUI Upgrade and State Safety

Back up data volume:

mkdir -p ~/backups

docker run --rm \
  -v open-webui_open-webui:/source:ro \
  -v ~/backups:/backup \
  alpine \
  sh -c 'cd /source && tar czf /backup/open-webui-$(date +%Y%m%d-%H%M%S).tar.gz .'

Capture existing key:

SECRET_KEY="$(docker inspect open-webui --format '{{range .Config.Env}}{{println .}}{{end}}' | sed -n 's/^WEBUI_SECRET_KEY=//p' | head -n 1)"

Recreate:

docker pull ghcr.io/open-webui/open-webui:v0.9.1
docker rm -f open-webui
# re-run with same flags + preserved SECRET_KEY

If container is gone:

restore volume from backup
recover key from shell history
last resort: new key (sessions break)

Caddy (full shape)

http://iona.local.arpeggio.one {
    reverse_proxy 127.x.x.x:3000
}

https://iona.local.arpeggio.one {
    tls internal
    reverse_proxy 127.x.x.x:3000
}

http://iona-api.local.arpeggio.one {
    reverse_proxy 127.x.x.x:8080
}

https://iona-api.local.arpeggio.one {
    tls internal
    reverse_proxy 127.x.x.x:8080
}

Validate + reload:

sudo caddy fmt --overwrite /etc/caddy/Caddyfile
sudo caddy validate --config /etc/caddy/Caddyfile
sudo systemctl reload caddy

Socket checks:

sudo ss -ltnp | grep -E ':3000|:8080|:80|:443'

Resolve checks:

curl --resolve iona.local.arpeggio.one:80:127.x.x.x http://iona.local.arpeggio.one/
curl -k --resolve iona.local.arpeggio.one:443:127.x.x.x https://iona.local.arpeggio.one/

Hotspot + DNS (optional but part of system)

Start hotspot:

sudo nmcli dev wifi hotspot \
  ifname wl**********c6 \
  ssid MoonshotBridge \
  password "YOUR_PASSWORD"

Inject DNS:

sudo mkdir -p /etc/NetworkManager/dnsmasq-shared.d
sudo tee /etc/NetworkManager/dnsmasq-shared.d/iona.conf >/dev/null <<'EOF'
address=/iona.local.arpeggio.one/10.x.x.x
address=/iona-api.local.arpeggio.one/10.x.x.x
EOF

Verify:

dig @10.x.x.x iona.local.arpeggio.one

systemd Wrapper (optional)

[Unit]
Description=Iona local AI containers
Requires=docker.service
After=docker.service network-online.target

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/docker start open-webui gemma-local
ExecStop=/usr/bin/docker stop open-webui gemma-local

[Install]
WantedBy=multi-user.target

Stop / Restart

docker stop open-webui gemma-local
docker start open-webui gemma-local

cd ~/local-ai/gemma4
docker compose stop gemma
docker compose up -d

sudo systemctl restart caddy

Operational Checklist

docker ps --format 'table {{.Names}}	{{.Status}}	{{.Ports}}'
sudo ss -ltnp | grep -E ':3000|:8080|:80|:443'
getent hosts iona.local.arpeggio.one iona-api.local.arpeggio.one

curl -s http://127.x.x.x:8080/health

Practical Boundaries

Chosen:

26B daily driver
local-only routing
manual memory fallback

Deferred:

31B daily driver
public exposure
autonomous memory reliance

TSG

Models do not appear in Open WebUI

Symptom

the provider connection looks valid, but the model list is empty

Likely causes

provider URL is wrong inside the container
/v1 was omitted
saved provider state in WebUI is stale
the model ID needs to be surfaced manually

Checks

curl -s http://127.x.x.x:8080/v1/models \
  -H "Authorization: Bearer $(cat ~/local-ai/gemma4/.api_key)" && echo

Inside WebUI, verify the provider URL is:

http://iona-api.local.arpeggio.one/v1

Recovery

fix the provider URL
recreate the provider entry if the saved state looks stale
add gemma-4-26b-a4b under Model IDs filter if discovery still fails

Open WebUI shows a network error when calling the model

Symptom

UI loads, but requests to the local provider fail with a network error

Likely causes

the container cannot reach the host-side API
localhost or 127.x.x.x was used inside the WebUI container
host-gateway mappings were not added at container start

Checks

Inspect the container launch flags and confirm these host mappings exist:

--add-host=host.docker.internal:host-gateway
--add-host=iona-api.local.arpeggio.one:host-gateway
--add-host=iona.local.arpeggio.one:host-gateway

Recovery

use http://iona-api.local.arpeggio.one/v1 in WebUI, not host loopback
recreate the WebUI container with the host-gateway mappings if they were omitted

KV allocation or buffer allocation fails

Symptom

failed to allocate buffer for kv cache
model starts inconsistently or fails during larger runs

Likely causes

context is set too high for the model + cache combination
the chosen model leaves too little VRAM headroom
batching is too aggressive for the available fit

Checks

Review the runtime flags in use:

--ctx-size
--cache-type-k
--cache-type-v
--batch-size
--ubatch-size
--fit-target

Recovery

step down to the 26B configuration if you drifted upward
reduce context before changing many things at once
keep KV cache quantized
avoid simultaneous changes to model size, context, and batching because that makes failures harder to localize

GGUF file cannot be opened

Symptom

failed to open GGUF file '/models/model.gguf'

Likely causes

the symlink points outside the bind-mounted /models tree
the symlink is absolute when it should be relative
the target file was renamed or moved

Checks

ls -lah ~/local-ai/gemma4/models/model.gguf
ls -lah ~/local-ai/gemma4/models/mmproj.gguf

Recovery

Recreate the stable pointers as relative symlinks:

cd ~/local-ai/gemma4/models || exit 1
rm -f model.gguf mmproj.gguf
ln -s ./gemma4-26b/google_gemma-4-26B-A4B-it-Q4_0.gguf model.gguf
ln -s ./gemma4-26b/mmproj-google_gemma-4-26B-A4B-it-f16.gguf mmproj.gguf

Browser warns on HTTPS or local names fail to resolve

Symptom

browser warns on https://iona.local.arpeggio.one
the router box gets NXDOMAIN while hotspot clients work

Likely causes

Caddy internal CA is not yet trusted on the client
the host machine is using its own resolver instead of the hotspot DNS responder

Checks

curl --resolve iona.local.arpeggio.one:80:127.x.x.x http://iona.local.arpeggio.one/
curl -k --resolve iona.local.arpeggio.one:443:127.x.x.x https://iona.local.arpeggio.one/

dig @10.x.x.x iona.local.arpeggio.one
dig @10.x.x.x iona-api.local.arpeggio.one

Recovery

trust Caddy's internal CA on the client, or use HTTP temporarily
if the host itself cannot resolve the names, add a host-only fallback:

echo '10.x.x.x iona.local.arpeggio.one iona-api.local.arpeggio.one' | sudo tee -a /etc/hosts

Open WebUI upgrade breaks sessions or stored secrets

Symptom

users are logged out after recreation
saved tokens or secrets no longer decrypt correctly

Likely causes

WEBUI_SECRET_KEY changed during container recreation
the data volume was recreated without a matching key

Checks

Capture the existing key before deleting the container:

SECRET_KEY="$(docker inspect open-webui --format '{{range .Config.Env}}{{println .}}{{end}}' | sed -n 's/^WEBUI_SECRET_KEY=//p' | head -n 1)"
[ -n "$SECRET_KEY" ] || { echo "WEBUI_SECRET_KEY not found"; exit 1; }

Inspect the actual mount in use:

docker inspect open-webui --format '{{range .Mounts}}{{println .Type .Name .Source "->" .Destination}}{{end}}'

Recovery

preserve the original WEBUI_SECRET_KEY when recreating the container
back up the data volume before upgrade
if the container is already gone, restore the volume first and recover the key from container env or shell history if possible

Hotspot comes back but local names do not work for clients

Symptom

clients join the hotspot, but iona.local.arpeggio.one does not resolve

Likely causes

the dnsmasq shared config was not written or not reloaded
NetworkManager was restarted but the hotspot profile was not brought back up

Checks

sudo grep -R "iona.local.arpeggio.one" /etc/NetworkManager/dnsmasq-shared.d

dig @10.x.x.x iona.local.arpeggio.one

Recovery

sudo systemctl restart NetworkManager
sudo nmcli connection up "Hotspot"
sudo nmcli connection up "Hotspot-2"

If the config is missing, recreate it:

sudo mkdir -p /etc/NetworkManager/dnsmasq-shared.d
sudo tee /etc/NetworkManager/dnsmasq-shared.d/iona.conf >/dev/null <<'EOF'
address=/iona.local.arpeggio.one/10.x.x.x
address=/iona-api.local.arpeggio.one/10.x.x.x
EOF

Trivia: Why “Iona” and “Arpeggio” fit

The names are a small nod to Arpeggio of Blue Steel, where Iona is one of the central mental models and the title itself carries the maritime frame.

That makes the pairing unusually apt here.

Arpeggio already fits the broader project identity: structured progression, navigation, cadence, and movement across a chart rather than a static point.
Iona fits the local stack: compact, capable, calm under normal operation, and most useful when treated as part of a vessel’s working system rather than as magic.

There is also a mild joke in it. In the anime, Iona is a warship avatar. Here, “Iona” ended up as a local model stack running on aging GPUs, reverse proxies, DNS glue, and enough shell commands to annoy a normal person. The name overshoots the hardware a little, which is part of the charm.