general 2026-06-02 · Updated 2026-06-02

Building a Local AI That Can Search the Web and Remember Across Threads

hero I have been experimenting with a local AI setup that feels less like a toy chatbot and more like a working systems aide: a local model running through llama.cpp, Open WebUI as the interface, Crawl4AI for web extraction, and Open WebUI’s own memory and chat-history tools for implicit recall across conversations.

The result is surprisingly usable. The model can answer normal chat questions, search the web when needed, crawl and summarize pages, and retrieve useful context from previous Open WebUI threads.

It is not magic.

It is plumbing.

But good plumbing is underrated.

What This Stack Does

The goal was to run a local AI that could do four things:

Run a capable local model.
Use Open WebUI as the chat interface.
Search and crawl the web from inside a conversation.
Recall useful context from previous Open WebUI threads.

The core stack looks like this:

Open WebUI
  → llama.cpp model server
  → Open WebUI native web search
  → Crawl4AI for web page extraction
  → Open WebUI Memory / Chat History / Knowledge Base tools

The important distinction is that web search, web crawling, and cross-thread memory are separate capabilities.

Native Web Search = finds candidate web results
Crawl4AI = crawls and extracts readable content from URLs
Chat History = searches previous Open WebUI conversations
Memory = stores and retrieves durable user/project context
Knowledge Base = RAG over uploaded files and knowledge collections

Crawl4AI does not make the AI remember previous chats. It handles external web pages.

The cross-thread recall comes from Open WebUI’s built-in tools, especially Chat History and Memory, when function calling is enabled.

Architecture: Services and Data Flow

There are three main runtime services:

gemma       = local llama.cpp model server
open-webui  = chat interface and tool orchestrator
crawl4ai    = local web crawler / extractor

The runtime data flow looks like this:

User
  → Open WebUI
    → local model server via OpenAI-compatible API
    → built-in Open WebUI tools
    → external Web Search and Crawl tool
      → Open WebUI native web search
      → Crawl4AI
        → target web pages

There are two major retrieval paths.

The first is external web retrieval:

User asks web/current-info question
  → model calls Web Search and Crawl tool
  → Open WebUI native search finds URLs
  → Crawl4AI crawls URLs
  → extracted content returns to model
  → model answers with fresh context

The second is local continuity retrieval:

User asks about prior work or remembered context
  → model calls Open WebUI built-in tools
  → Chat History / Memory / Notes / Knowledge Base retrieve local context
  → model answers using retrieved prior material

That second path is what makes the assistant feel like it can remember across threads.

It is not the model magically having infinite context. It is retrieval over Open WebUI’s stored data.

That distinction matters.

The Model Server

The local model is served through llama.cpp using the CUDA server image.

In this setup, the model server exposes an OpenAI-compatible API on port 8080.

A local HTTPS proxy such as Caddy can sit in front of the model server, allowing Open WebUI to talk to the local model through a familiar OpenAI-compatible endpoint:

https://your-local-model-domain.example/v1

A quick health check looks like this:

curl -k https://your-local-model-domain.example/v1/models

If that returns a model list, the model server and HTTPS route are alive.

If you do not use Caddy or HTTPS locally, you can point Open WebUI directly at the model server instead:

http://host.docker.internal:8080/v1

or, depending on your Docker network:

http://gemma:8080/v1

The exact endpoint depends on whether Open WebUI reaches the model through the host, a reverse proxy, or the internal Docker network.

Why Docker Compose Matters

Originally, I started containers manually with docker run.

That worked until Open WebUI needed to call Crawl4AI by container name:

http://crawl4ai:11235

The problem was Docker networking.

A container name like crawl4ai is only resolvable from another container if both containers share the same Docker network.

When containers are started one by one, they may not land on the same user-defined bridge network. So Open WebUI may not be able to resolve crawl4ai, even if Crawl4AI is running perfectly from the host.

Docker Compose fixes this by declaring the stack as one unit. Compose creates a shared network and attaches all services to it. Then service names become internal DNS names.

So this works:

open-webui → http://crawl4ai:11235

because both services are on the same Compose network.

The lesson is simple: if containers need to talk to each other by name, put them in the same Compose file or attach them to the same user-defined Docker network.

Example Docker Compose File

Here is the basic Compose shape for this setup.

This example runs:

gemma       = local llama.cpp model server
open-webui  = Open WebUI interface
crawl4ai    = local web crawler / extractor

It deliberately does not include SearXNG.

The search path here assumes Open WebUI native web search is enabled, and Crawl4AI is only used to crawl and extract the URLs returned by native search.

services:
  gemma:
    image: ghcr.io/ggml-org/llama.cpp:server-cuda
    container_name: gemma-local
    restart: unless-stopped
    gpus: all
    ports:
      - "127.0.0.1:8080:8080"
    env_file:
      - .env
    volumes:
      - ./models:/models:ro
    command: >
      -m /models/gemma4-31b/google_gemma-4-31B-it-Q4_K_S.gguf
      --mmproj /models/gemma4-31b/mmproj-google_gemma-4-31B-it-bf16.gguf
      --host 0.0.0.0
      --port 8080
      --api-key ${LLM_API_KEY}
      --ctx-size 28000
      --parallel 1
      --cache-type-k q4_0
      --cache-type-v q4_0
      --fit on
      --fit-target 2048,256
      --batch-size 512
      --ubatch-size 312
    networks:
      - local-ai

  open-webui:
    image: ghcr.io/open-webui/open-webui:v0.9.1
    container_name: open-webui
    restart: unless-stopped
    ports:
      - "127.0.0.1:3000:8080"
    volumes:
      - open-webui_open-webui:/app/backend/data
    depends_on:
      - gemma
      - crawl4ai
    networks:
      - local-ai

  crawl4ai:
    image: unclecode/crawl4ai:latest
    container_name: crawl4ai
    restart: unless-stopped
    shm_size: "1g"
    ports:
      - "127.0.0.1:11235:11235"
    networks:
      - local-ai

volumes:
  open-webui_open-webui:
    external: true

networks:
  local-ai:
    name: local-ai

The important detail is the shared Docker network:

networks:
  local-ai:
    name: local-ai

Each service joins that network:

networks:
  - local-ai

That is what allows Open WebUI to call Crawl4AI by service name:

http://crawl4ai:11235

Without the shared network, crawl4ai may be running perfectly, but Open WebUI will not be able to resolve the hostname from inside its own container.

Docker networking, as usual, works beautifully once you already know the answer.

Compose Flag and Setting Breakdown

The Compose file is small, but several lines matter.

image: ghcr.io/ggml-org/llama.cpp:server-cuda

This uses the CUDA-enabled llama.cpp server image. It lets the local model run with GPU acceleration.

container_name: gemma-local

This gives the model server a stable container name. That makes logs and debugging easier:

docker logs -f gemma-local

restart: unless-stopped

This tells Docker to restart the container after crashes or reboots, unless you explicitly stop it.

gpus: all

This gives the container access to available NVIDIA GPUs. For a single-GPU setup, this is usually enough. For multi-GPU tuning, advanced users can add explicit split settings later, but this example keeps the default behavior to avoid baking in hardware-specific assumptions.

ports:
  - "127.0.0.1:8080:8080"

This exposes the model server only on localhost, not on the public network.

The format is:

host_ip:host_port:container_port

So this means:

127.0.0.1:8080 on the host
  → port 8080 inside the container

Binding to 127.0.0.1 is safer than binding to 0.0.0.0, because it avoids exposing the service to the LAN or public network by accident.

env_file:
  - .env

This loads environment variables from a local .env file.

In this setup, that file contains the local model server API key:

LLM_API_KEY=replace-this-with-your-local-key

Do not publish real API keys.

volumes:
  - ./models:/models:ro

This mounts the local ./models directory into the container at /models.

The :ro suffix means read-only. The model server can read model files but cannot modify them.

command: >

This passes startup arguments to the llama.cpp server.

Important model-server flags:

-m /models/...gguf

Path to the main GGUF model file.

--mmproj /models/...gguf

Path to the multimodal projection file, if using a multimodal model.

--host 0.0.0.0

Makes the server listen on all interfaces inside the container.

This is necessary because the container port must be reachable from outside the container. The host-level port binding still restricts exposure to 127.0.0.1.

--port 8080

Runs the model API on port 8080.

--api-key ${LLM_API_KEY}

Protects the model endpoint with an API key from the .env file.

--ctx-size 28000

Sets the context window used by the server.

Larger context allows longer prompts and retrieved material, but uses more VRAM/RAM.

--parallel 1

Controls how many parallel request slots the server handles.

For a large local model on limited VRAM, 1 is often safer.

--cache-type-k q4_0
--cache-type-v q4_0

Quantizes the KV cache to reduce memory usage.

This helps fit larger context windows on available hardware, at some quality/performance tradeoff.

--fit on
--fit-target 2048,256

Model/server-specific fitting behavior. In practice, this helps the server manage prompt/image or multimodal fit behavior depending on the build and model support.

--batch-size 512
--ubatch-size 312

Controls batching behavior.

These affect throughput and memory use. Larger values can improve speed but may consume more VRAM.

This example intentionally does not include:

--split-mode layer
--tensor-split ...

Those flags are useful for multi-GPU tuning, but they are hardware-specific. Leaving them out keeps the example portable and avoids implying a particular GPU arrangement.

For Open WebUI:

volumes:
  - open-webui_open-webui:/app/backend/data

This preserves Open WebUI settings, model profiles, memory, chat history, knowledge, and uploaded data.

For Crawl4AI:

shm_size: "1g"

This gives Crawl4AI more shared memory, which is useful because browser-based crawling can otherwise run into shared-memory limits.

ports:
  - "127.0.0.1:11235:11235"

This exposes Crawl4AI on localhost for host testing, while still allowing Open WebUI to call it internally by service name:

http://crawl4ai:11235

Preserving Open WebUI Settings

The Open WebUI data volume is critical:

volumes:
  - open-webui_open-webui:/app/backend/data

In my setup, this reused an existing named volume.

That preserved Open WebUI settings, model configuration, memory, chat history, and stored conversations.

If you are setting this up from scratch, you can use a normal named volume instead:

volumes:
  open-webui:

and mount it like this:

volumes:
  - open-webui:/app/backend/data

But if you already have an Open WebUI install with existing settings, inspect the current container before replacing it:

docker inspect open-webui --format '{{json .Mounts}}' | jq

Look for the volume mounted at:

/app/backend/data

That is the volume you want to preserve.

The rule is simple:

Containers are disposable.
Volumes are where the memory lives.

If you accidentally mount a fresh volume, Open WebUI may still run, but it will not have the same settings, chats, memories, or model configuration.

Starting the Stack

From the local AI project directory:

docker stop open-webui gemma-local crawl4ai 2>/dev/null || true
docker rm open-webui gemma-local crawl4ai 2>/dev/null || true

docker compose -f docker-compose.31b.native-search-crawl4ai.yml up -d

The first command stops any old manually-created containers.

docker stop open-webui gemma-local crawl4ai 2>/dev/null || true

The 2>/dev/null part hides error messages if a container does not exist.

The || true part prevents the command from failing the whole script.

The second command removes the stopped containers:

docker rm open-webui gemma-local crawl4ai 2>/dev/null || true

This is needed because Docker Compose cannot create a container with a name already used by a manually-created container.

The final command starts the Compose stack:

docker compose -f docker-compose.31b.native-search-crawl4ai.yml up -d

Flag breakdown:

-f docker-compose.31b.native-search-crawl4ai.yml

Use this specific Compose file.

up

Create and start the services.

-d

Detached mode. Run in the background.

To watch the logs:

docker compose -f docker-compose.31b.native-search-crawl4ai.yml logs --tail=100 -f

Flag breakdown:

logs

Show service logs.

--tail=100

Show only the last 100 log lines initially.

-f

Follow logs live.

To watch only the model server logs:

docker logs --tail=100 -f gemma-local

To watch resource usage:

docker stats open-webui gemma-local crawl4ai

To watch GPU activity:

watch -n 1 nvidia-smi

Verifying the Model

Check that the model endpoint works from the host:

curl -k https://your-local-model-domain.example/v1/models

If you are not using HTTPS or a reverse proxy, check the local port instead:

curl http://127.0.0.1:8080/v1/models

Then check that Open WebUI can reach the model from inside its container:

docker exec -it open-webui sh -lc \
  'curl -k https://your-local-model-domain.example/v1/models'

Flag breakdown:

docker exec

Run a command inside an existing container.

-it

Interactive terminal mode.

open-webui

The container to run the command inside.

sh -lc

Run the command through a shell.

If the host can reach the model but Open WebUI cannot, the issue is container DNS, routing, certificate trust, or host gateway access.

If both can reach it, the model path is healthy.

Installing Crawl4AI

Crawl4AI runs as a local service. It exposes an HTTP API on port 11235.

The standalone command is:

docker run -d \
  --name crawl4ai \
  --restart unless-stopped \
  --shm-size=1g \
  -p 11235:11235 \
  unclecode/crawl4ai:latest

Flag breakdown:

-d

Run detached in the background.

--name crawl4ai

Give the container a stable name.

--restart unless-stopped

Restart automatically after reboot or crash unless manually stopped.

--shm-size=1g

Give the container 1 GB of shared memory.

-p 11235:11235

Expose container port 11235 on host port 11235.

unclecode/crawl4ai:latest

The Crawl4AI Docker image.

For a real stack, Compose is cleaner because Open WebUI needs to resolve Crawl4AI by service name.

After the stack is up, test Crawl4AI from inside Open WebUI:

docker exec -it open-webui sh -lc \
  'curl -sS http://crawl4ai:11235 | head'

If that works, Open WebUI can reach Crawl4AI over the internal Docker network.

Web Search Versus Web Crawling

This part is easy to confuse.

Open WebUI native search finds web results.

Crawl4AI fetches and extracts page content from URLs.

The intended flow is:

Open WebUI native web search
  → returns URLs
  → Web Search and Crawl tool sends URLs to Crawl4AI
  → Crawl4AI extracts content
  → model reads the extracted content

Crawl4AI is not a search engine. It is a crawler and extractor.

I briefly considered adding SearXNG, which is a self-hosted metasearch engine. But that is an optional extra search layer, not required for this setup.

For now, the simpler path is:

Open WebUI native search → Crawl4AI

That avoids running an extra metasearch service unless native search proves unreliable.

Setting Up the Open WebUI Side

The Docker stack gets the services running, but Open WebUI still needs to know which tools the model is allowed to use.

This part is easy to miss because the infrastructure can be healthy while the model still behaves like it has no tools. The containers may all be running, Crawl4AI may be reachable, and the model may be visible, but if the model profile is not configured correctly, nothing agentic happens.

There are three separate Open WebUI pieces to configure:

1. Import the Web Search and Crawl tool.
2. Attach that tool to the model profile.
3. Enable native function calling and built-in tools such as Memory and Chat History.

These are related, but not the same.

Importing the Web Search and Crawl Tool

The external web crawling capability comes from an Open WebUI tool plugin called Web Search and Crawl.

The tool I used is titled:

Web Search and Crawl

Its description says it can search and crawl the web using SearXNG, Open WebUI native search, and Crawl4AI. It extracts content from URLs using a self-hosted Crawl4AI instance, with optional deeper research behavior.

In Open WebUI, the rough flow is:

Workspace
  → Tools
  → Add / Import tool
  → paste or import the Web Search and Crawl script
  → Save

Depending on your Open WebUI version, the tool may come from the community tool store / tool marketplace flow rather than being pasted manually. The community flow usually looks like:

Find the tool
  → View / Get
  → Import to WebUI
  → Save under Workspace → Tools

The important part is that the tool must end up saved in Open WebUI’s Workspace → Tools area before a model can use it.

Security note: Open WebUI tools and functions execute Python code on the server. That is powerful, but it also means you should only install tools from sources you trust.

This is not browser decoration. This is server-side code execution wearing a friendly UI button.

Configuring the Web Search and Crawl Tool

Once the tool is imported, configure its valves.

For the simple setup in this post, the intended path is:

Open WebUI native search
  → Web Search and Crawl tool
  → Crawl4AI

That means the tool should use Open WebUI’s native web search to find URLs, then send those URLs to Crawl4AI for extraction.

The important valve settings are:

USE_NATIVE_SEARCH = true
SEARCH_WITH_SEARXNG = false
CRAWL4AI_BASE_URL = http://crawl4ai:11235
DEBUG = true
MORE_STATUS = true

This deliberately keeps SearXNG out of the stack.

SearXNG is useful in some setups, but it is not required here. The first working version of this stack uses Open WebUI’s native search as the search layer and Crawl4AI as the crawler/extractor layer.

The resulting flow is:

User asks a web question
  → model calls Web Search and Crawl
  → tool asks Open WebUI native search for URLs
  → tool sends URLs to Crawl4AI
  → Crawl4AI extracts readable page content
  → model answers from extracted context

Crawl4AI itself is not the search engine. It only crawls URLs handed to it.

Attaching the Tool to the Model Profile

Importing a tool into Workspace does not automatically mean every model can use it.

The model profile must be configured to include that tool.

In Open WebUI, the flow is roughly:

Workspace
  → Models
  → select your model profile
  → Tools
  → select Web Search and Crawl
  → Save & Update

This distinction matters. A tool can exist in the Workspace and still not be available to a specific model profile.

If the model is not calling the tool, check both places:

Workspace → Tools
Model profile → Tools

Some Open WebUI versions also let you enable tools from the chat input area using a plus button or tool selector. If a tool exists but is not being called, check both the model profile and the active chat-level tool selection.

Function Calling Must Be Native

The key model setting is:

Function Calling: Native

Set the model profile to:

Function Calling = Native

This matters because Open WebUI’s modern tool behavior depends on native function calling.

Without Native function calling, the model may still chat normally, but the richer tool behavior may not work.

Native function calling allows the model to make structured tool calls and decide when a tool is needed.

In practice:

Default / legacy behavior = may not call tools reliably
Native function calling = tool-capable model behavior

So if web search, memory, or chat-history tools do not seem to work, check this first.

Enabling Built-in Tools: Memory, Chat History, Notes, and Knowledge Base

The web crawling tool is only one part of the experience.

The surprising part was cross-thread recall: the model could answer questions about earlier Open WebUI conversations.

That did not come from Crawl4AI.

It came from Open WebUI’s built-in tools.

In the model profile, enable the relevant built-in tools:

Memory
Chat History
Notes
Knowledge Base

This is what allows the model to search previous Open WebUI threads.

The rough flow is:

User asks about an earlier thread
  → model calls Chat History search
  → Open WebUI retrieves matching prior conversations
  → model summarizes the relevant results

That is different from ordinary model memory.

The model is not remembering every previous chat in its weights. It is using Open WebUI’s stored conversation history as a searchable local archive.

What “Memory Across Threads” Actually Means

It is tempting to say the model “remembers” previous threads.

A more accurate description is:

The model can retrieve previous Open WebUI threads when it has tool access.

That distinction matters.

The architecture is:

Open WebUI database
  → stored chats
  → memories
  → notes
  → knowledge bases
  → retrieval tools
  → current model context

The model is not omniscient.

It can only retrieve what Open WebUI has stored and what its enabled tools allow it to search.

That is actually a better design. It means recall can be scoped, inspectable, and local, rather than pretending the model has infinite magical context.

Verifying Cross-Thread Recall

A simple test is to ask the model about a known previous Open WebUI thread:

Search chat history for “AI Realism Analysis” and summarize what we discussed.

A successful response should show that the model used tools such as:

search_chats
view_chat
search_memories
query_knowledge_files

The exact tool names may vary by Open WebUI version and configuration, but the behavior is the same: the model searches local Open WebUI history, retrieves relevant content, and summarizes it.

This is the moment the system starts to feel less like a disposable chatbot and more like a local assistant with bridge logs.

Again, not magic.

Just retrieval.

But useful retrieval.

Open WebUI’s Native RAG Layer

Open WebUI also has its own native RAG pipeline for knowledge bases, uploaded files, notes, and related context.

Conceptually, that looks like this:

Files / knowledge / notes / chat history
  → text extraction and chunking
  → embeddings
  → vector retrieval
  → prompt context
  → model answer

Depending on configuration, Open WebUI can use its default local vector storage or an external vector database.

For a solo local setup, the default path is often enough. If the system grows into a larger personal knowledge base with manuals, boat logs, code notes, essays, and long-running project history, then it may be worth moving the vector layer into a more explicit service boundary.

But the key point is that Open WebUI already provides more than a chat box. It can act as a local retrieval layer over your own material.

Debugging Checklist

If the model disappears from Open WebUI, check the model endpoint:

curl -k https://your-local-model-domain.example/v1/models

If that works from the host, test from inside Open WebUI:

docker exec -it open-webui sh -lc \
  'curl -k https://your-local-model-domain.example/v1/models'

If Crawl4AI does not work, test it from inside Open WebUI:

docker exec -it open-webui sh -lc \
  'curl -sS http://crawl4ai:11235 | head'

If that fails with DNS errors, Open WebUI and Crawl4AI are not on the same Docker network.

If the model is slow to appear after startup, the local model may still be loading. Watch the llama.cpp logs:

docker logs --tail=100 -f gemma-local

If web search returns nothing and Crawl4AI stays quiet, native search probably did not return URLs or the web tool did not invoke crawling.

Watch logs:

docker compose -f docker-compose.31b.native-search-crawl4ai.yml logs --tail=100 -f open-webui crawl4ai gemma

Setup Checklist

At a high level, the working setup requires:

Docker:
  [ ] llama.cpp model server running
  [ ] Open WebUI running
  [ ] Crawl4AI running
  [ ] all services on the same Docker network
  [ ] Open WebUI volume preserved

Open WebUI Workspace:
  [ ] Web Search and Crawl tool imported
  [ ] Web Search and Crawl tool saved under Workspace → Tools
  [ ] tool valves configured for native search + Crawl4AI

Model Profile:
  [ ] Function Calling set to Native
  [ ] Web Search and Crawl attached under Tools
  [ ] Memory enabled
  [ ] Chat History enabled
  [ ] Notes enabled if desired
  [ ] Knowledge Base enabled if desired

Verification:
  [ ] model endpoint returns /v1/models
  [ ] Open WebUI can reach the model endpoint
  [ ] Open WebUI can reach http://crawl4ai:11235
  [ ] model can call web search/crawl
  [ ] model can search prior Open WebUI threads

The stack only feels seamless after all three layers are configured: containers, tools, and model profile.

If any one of those is missing, the system may still boot, but the assistant will behave like it has no eyes, no memory, or no hands.

Local AI is very dignified like that.

What This Setup Gives You

This stack gives you a local AI interface with several useful layers:

Local model inference
Open WebUI chat interface
HTTPS model routing through Caddy or another reverse proxy
Native Open WebUI web search
Crawl4AI web extraction
Open WebUI memory
Open WebUI chat-history recall
Open WebUI knowledge/file RAG

That combination is enough to make the assistant feel meaningfully continuous without handing everything to a cloud service.

It can answer from the current chat, retrieve earlier Open WebUI conversations, remember stable facts, search the web, crawl pages, and reason over the extracted content.

That is the interesting part. The model is not just generating text from a frozen prompt. It is operating over a local archive, local tools, and local infrastructure.

Privacy and Scope

This setup is local-first, but “local-first” does not automatically mean “private by default.”

A few cautions:

Do not publish real API keys.
Do not publish private hostnames unless intentional.
Do not mount sensitive folders casually.
Do not expose Open WebUI or model APIs publicly without authentication.
Do not assume web search providers are private.
Do not assume every tool call is harmless.

The model can become more useful when it can search memory, chat history, files, and the web. But the same capability also means tool scope matters.

A good local assistant should be able to retrieve context without rummaging through everything unnecessarily.

The Bigger Point

A local AI assistant becomes much more useful when it has three things:

A competent local model.
Tool access.
Retrieval over your own history and documents.

The model alone is just a brain in a jar.

Open WebUI gives it a body.

Memory and chat-history retrieval give it continuity.

Crawl4AI gives it eyes on the web.

The result is not perfect. It still needs careful configuration, Docker networking can still be petty, and local models still have limits. But the shape is right.

You do not need a giant cloud system to get something useful. You need a local model, a stable interface, a persistent data volume, and tools that let the model reach beyond the current prompt.

Once those pieces are wired together, the experience changes.

It stops feeling like a disposable chatbot and starts feeling like a local system you can build a working relationship with.

Not magic.

Just plumbing.

But good plumbing is underrated.