Building a Local AI That Can Search the Web and Remember Across Threads
I have been experimenting with a local AI setup that feels less like a toy chatbot and more like a working systems aide: a local model running through
llama.cpp, Open WebUI as the interface, Crawl4AI for web extraction, and Open WebUI’s own memory and chat-history tools for implicit recall across conversations.
The result is surprisingly usable. The model can answer normal chat questions, search the web when needed, crawl and summarize pages, and retrieve useful context from previous Open WebUI threads.
It is not magic.
It is plumbing.
But good plumbing is underrated.
What This Stack Does
The goal was to run a local AI that could do four things:
- Run a capable local model.
- Use Open WebUI as the chat interface.
- Search and crawl the web from inside a conversation.
- Recall useful context from previous Open WebUI threads.
The core stack looks like this:
Open WebUI → llama.cpp model server → Open WebUI native web search → Crawl4AI for web page extraction → Open WebUI Memory / Chat History / Knowledge Base tools
The important distinction is that web search, web crawling, and cross-thread memory are separate capabilities.
Native Web Search = finds candidate web results Crawl4AI = crawls and extracts readable content from URLs Chat History = searches previous Open WebUI conversations Memory = stores and retrieves durable user/project context Knowledge Base = RAG over uploaded files and knowledge collections
Crawl4AI does not make the AI remember previous chats. It handles external web pages.
The cross-thread recall comes from Open WebUI’s built-in tools, especially Chat History and Memory, when function calling is enabled.
Architecture: Services and Data Flow
There are three main runtime services:
gemma = local llama.cpp model server open-webui = chat interface and tool orchestrator crawl4ai = local web crawler / extractor
The runtime data flow looks like this:
User → Open WebUI → local model server via OpenAI-compatible API → built-in Open WebUI tools → external Web Search and Crawl tool → Open WebUI native web search → Crawl4AI → target web pages
There are two major retrieval paths.
The first is external web retrieval:
User asks web/current-info question → model calls Web Search and Crawl tool → Open WebUI native search finds URLs → Crawl4AI crawls URLs → extracted content returns to model → model answers with fresh context
The second is local continuity retrieval:
User asks about prior work or remembered context → model calls Open WebUI built-in tools → Chat History / Memory / Notes / Knowledge Base retrieve local context → model answers using retrieved prior material
That second path is what makes the assistant feel like it can remember across threads.
It is not the model magically having infinite context. It is retrieval over Open WebUI’s stored data.
That distinction matters.
The Model Server
The local model is served through llama.cpp using the CUDA server image.
In this setup, the model server exposes an OpenAI-compatible API on port 8080.
A local HTTPS proxy such as Caddy can sit in front of the model server, allowing Open WebUI to talk to the local model through a familiar OpenAI-compatible endpoint:
https://your-local-model-domain.example/v1
A quick health check looks like this:
curl -k https://your-local-model-domain.example/v1/models
If that returns a model list, the model server and HTTPS route are alive.
If you do not use Caddy or HTTPS locally, you can point Open WebUI directly at the model server instead:
http://host.docker.internal:8080/v1
or, depending on your Docker network:
http://gemma:8080/v1
The exact endpoint depends on whether Open WebUI reaches the model through the host, a reverse proxy, or the internal Docker network.
Why Docker Compose Matters
Originally, I started containers manually with docker run.
That worked until Open WebUI needed to call Crawl4AI by container name:
http://crawl4ai:11235
The problem was Docker networking.
A container name like crawl4ai is only resolvable from another container if both containers share the same Docker network.
When containers are started one by one, they may not land on the same user-defined bridge network. So Open WebUI may not be able to resolve crawl4ai, even if Crawl4AI is running perfectly from the host.
Docker Compose fixes this by declaring the stack as one unit. Compose creates a shared network and attaches all services to it. Then service names become internal DNS names.
So this works:
open-webui → http://crawl4ai:11235
because both services are on the same Compose network.
The lesson is simple: if containers need to talk to each other by name, put them in the same Compose file or attach them to the same user-defined Docker network.
Example Docker Compose File
Here is the basic Compose shape for this setup.
This example runs:
gemma = local llama.cpp model server open-webui = Open WebUI interface crawl4ai = local web crawler / extractor
It deliberately does not include SearXNG.
The search path here assumes Open WebUI native web search is enabled, and Crawl4AI is only used to crawl and extract the URLs returned by native search.
services: gemma: image: ghcr.io/ggml-org/llama.cpp:server-cuda container_name: gemma-local restart: unless-stopped gpus: all ports: - "127.0.0.1:8080:8080" env_file: - .env volumes: - ./models:/models:ro command: > -m /models/gemma4-31b/google_gemma-4-31B-it-Q4_K_S.gguf --mmproj /models/gemma4-31b/mmproj-google_gemma-4-31B-it-bf16.gguf --host 0.0.0.0 --port 8080 --api-key ${LLM_API_KEY} --ctx-size 28000 --parallel 1 --cache-type-k q4_0 --cache-type-v q4_0 --fit on --fit-target 2048,256 --batch-size 512 --ubatch-size 312 networks: - local-ai open-webui: image: ghcr.io/open-webui/open-webui:v0.9.1 container_name: open-webui restart: unless-stopped ports: - "127.0.0.1:3000:8080" volumes: - open-webui_open-webui:/app/backend/data depends_on: - gemma - crawl4ai networks: - local-ai crawl4ai: image: unclecode/crawl4ai:latest container_name: crawl4ai restart: unless-stopped shm_size: "1g" ports: - "127.0.0.1:11235:11235" networks: - local-ai volumes: open-webui_open-webui: external: true networks: local-ai: name: local-ai
The important detail is the shared Docker network:
networks: local-ai: name: local-ai
Each service joins that network:
networks: - local-ai
That is what allows Open WebUI to call Crawl4AI by service name:
http://crawl4ai:11235
Without the shared network, crawl4ai may be running perfectly, but Open WebUI will not be able to resolve the hostname from inside its own container.
Docker networking, as usual, works beautifully once you already know the answer.
Compose Flag and Setting Breakdown
The Compose file is small, but several lines matter.
image: ghcr.io/ggml-org/llama.cpp:server-cuda
This uses the CUDA-enabled llama.cpp server image. It lets the local model run with GPU acceleration.
container_name: gemma-local
This gives the model server a stable container name. That makes logs and debugging easier:
docker logs -f gemma-local
restart: unless-stopped
This tells Docker to restart the container after crashes or reboots, unless you explicitly stop it.
gpus: all
This gives the container access to available NVIDIA GPUs. For a single-GPU setup, this is usually enough. For multi-GPU tuning, advanced users can add explicit split settings later, but this example keeps the default behavior to avoid baking in hardware-specific assumptions.
ports: - "127.0.0.1:8080:8080"
This exposes the model server only on localhost, not on the public network.
The format is:
host_ip:host_port:container_port
So this means:
127.0.0.1:8080 on the host → port 8080 inside the container
Binding to 127.0.0.1 is safer than binding to 0.0.0.0, because it avoids exposing the service to the LAN or public network by accident.
env_file: - .env
This loads environment variables from a local .env file.
In this setup, that file contains the local model server API key:
LLM_API_KEY=replace-this-with-your-local-key
Do not publish real API keys.
volumes: - ./models:/models:ro
This mounts the local ./models directory into the container at /models.
The :ro suffix means read-only. The model server can read model files but cannot modify them.
command: >
This passes startup arguments to the llama.cpp server.
Important model-server flags:
-m /models/...gguf
Path to the main GGUF model file.
--mmproj /models/...gguf
Path to the multimodal projection file, if using a multimodal model.
--host 0.0.0.0
Makes the server listen on all interfaces inside the container.
This is necessary because the container port must be reachable from outside the container. The host-level port binding still restricts exposure to 127.0.0.1.
--port 8080
Runs the model API on port 8080.
--api-key ${LLM_API_KEY}
Protects the model endpoint with an API key from the .env file.
--ctx-size 28000
Sets the context window used by the server.
Larger context allows longer prompts and retrieved material, but uses more VRAM/RAM.
--parallel 1
Controls how many parallel request slots the server handles.
For a large local model on limited VRAM, 1 is often safer.
--cache-type-k q4_0 --cache-type-v q4_0
Quantizes the KV cache to reduce memory usage.
This helps fit larger context windows on available hardware, at some quality/performance tradeoff.
--fit on --fit-target 2048,256
Model/server-specific fitting behavior. In practice, this helps the server manage prompt/image or multimodal fit behavior depending on the build and model support.
--batch-size 512 --ubatch-size 312
Controls batching behavior.
These affect throughput and memory use. Larger values can improve speed but may consume more VRAM.
This example intentionally does not include:
--split-mode layer --tensor-split ...
Those flags are useful for multi-GPU tuning, but they are hardware-specific. Leaving them out keeps the example portable and avoids implying a particular GPU arrangement.
For Open WebUI:
volumes: - open-webui_open-webui:/app/backend/data
This preserves Open WebUI settings, model profiles, memory, chat history, knowledge, and uploaded data.
For Crawl4AI:
shm_size: "1g"
This gives Crawl4AI more shared memory, which is useful because browser-based crawling can otherwise run into shared-memory limits.
ports: - "127.0.0.1:11235:11235"
This exposes Crawl4AI on localhost for host testing, while still allowing Open WebUI to call it internally by service name:
http://crawl4ai:11235
Preserving Open WebUI Settings
The Open WebUI data volume is critical:
volumes: - open-webui_open-webui:/app/backend/data
In my setup, this reused an existing named volume.
That preserved Open WebUI settings, model configuration, memory, chat history, and stored conversations.
If you are setting this up from scratch, you can use a normal named volume instead:
volumes: open-webui:
and mount it like this:
volumes: - open-webui:/app/backend/data
But if you already have an Open WebUI install with existing settings, inspect the current container before replacing it:
docker inspect open-webui --format '{{json .Mounts}}' | jq
Look for the volume mounted at:
/app/backend/data
That is the volume you want to preserve.
The rule is simple:
Containers are disposable. Volumes are where the memory lives.
If you accidentally mount a fresh volume, Open WebUI may still run, but it will not have the same settings, chats, memories, or model configuration.
Starting the Stack
From the local AI project directory:
docker stop open-webui gemma-local crawl4ai 2>/dev/null || true docker rm open-webui gemma-local crawl4ai 2>/dev/null || true docker compose -f docker-compose.31b.native-search-crawl4ai.yml up -d
The first command stops any old manually-created containers.
docker stop open-webui gemma-local crawl4ai 2>/dev/null || true
The 2>/dev/null part hides error messages if a container does not exist.
The || true part prevents the command from failing the whole script.
The second command removes the stopped containers:
docker rm open-webui gemma-local crawl4ai 2>/dev/null || true
This is needed because Docker Compose cannot create a container with a name already used by a manually-created container.
The final command starts the Compose stack:
docker compose -f docker-compose.31b.native-search-crawl4ai.yml up -d
Flag breakdown:
-f docker-compose.31b.native-search-crawl4ai.yml
Use this specific Compose file.
up
Create and start the services.
-d
Detached mode. Run in the background.
To watch the logs:
docker compose -f docker-compose.31b.native-search-crawl4ai.yml logs --tail=100 -f
Flag breakdown:
logs
Show service logs.
--tail=100
Show only the last 100 log lines initially.
-f
Follow logs live.
To watch only the model server logs:
docker logs --tail=100 -f gemma-local
To watch resource usage:
docker stats open-webui gemma-local crawl4ai
To watch GPU activity:
watch -n 1 nvidia-smi
Verifying the Model
Check that the model endpoint works from the host:
curl -k https://your-local-model-domain.example/v1/models
If you are not using HTTPS or a reverse proxy, check the local port instead:
curl http://127.0.0.1:8080/v1/models
Then check that Open WebUI can reach the model from inside its container:
docker exec -it open-webui sh -lc \ 'curl -k https://your-local-model-domain.example/v1/models'
Flag breakdown:
docker exec
Run a command inside an existing container.
-it
Interactive terminal mode.
open-webui
The container to run the command inside.
sh -lc
Run the command through a shell.
If the host can reach the model but Open WebUI cannot, the issue is container DNS, routing, certificate trust, or host gateway access.
If both can reach it, the model path is healthy.
Installing Crawl4AI
Crawl4AI runs as a local service. It exposes an HTTP API on port 11235.
The standalone command is:
docker run -d \ --name crawl4ai \ --restart unless-stopped \ --shm-size=1g \ -p 11235:11235 \ unclecode/crawl4ai:latest
Flag breakdown:
-d
Run detached in the background.
--name crawl4ai
Give the container a stable name.
--restart unless-stopped
Restart automatically after reboot or crash unless manually stopped.
--shm-size=1g
Give the container 1 GB of shared memory.
-p 11235:11235
Expose container port 11235 on host port 11235.
unclecode/crawl4ai:latest
The Crawl4AI Docker image.
For a real stack, Compose is cleaner because Open WebUI needs to resolve Crawl4AI by service name.
After the stack is up, test Crawl4AI from inside Open WebUI:
docker exec -it open-webui sh -lc \ 'curl -sS http://crawl4ai:11235 | head'
If that works, Open WebUI can reach Crawl4AI over the internal Docker network.
Web Search Versus Web Crawling
This part is easy to confuse.
Open WebUI native search finds web results.
Crawl4AI fetches and extracts page content from URLs.
The intended flow is:
Open WebUI native web search → returns URLs → Web Search and Crawl tool sends URLs to Crawl4AI → Crawl4AI extracts content → model reads the extracted content
Crawl4AI is not a search engine. It is a crawler and extractor.
I briefly considered adding SearXNG, which is a self-hosted metasearch engine. But that is an optional extra search layer, not required for this setup.
For now, the simpler path is:
Open WebUI native search → Crawl4AI
That avoids running an extra metasearch service unless native search proves unreliable.
Setting Up the Open WebUI Side
The Docker stack gets the services running, but Open WebUI still needs to know which tools the model is allowed to use.
This part is easy to miss because the infrastructure can be healthy while the model still behaves like it has no tools. The containers may all be running, Crawl4AI may be reachable, and the model may be visible, but if the model profile is not configured correctly, nothing agentic happens.
There are three separate Open WebUI pieces to configure:
1. Import the Web Search and Crawl tool. 2. Attach that tool to the model profile. 3. Enable native function calling and built-in tools such as Memory and Chat History.
These are related, but not the same.
Importing the Web Search and Crawl Tool
The external web crawling capability comes from an Open WebUI tool plugin called Web Search and Crawl.
The tool I used is titled:
Web Search and Crawl
Its description says it can search and crawl the web using SearXNG, Open WebUI native search, and Crawl4AI. It extracts content from URLs using a self-hosted Crawl4AI instance, with optional deeper research behavior.
In Open WebUI, the rough flow is:
Workspace → Tools → Add / Import tool → paste or import the Web Search and Crawl script → Save
Depending on your Open WebUI version, the tool may come from the community tool store / tool marketplace flow rather than being pasted manually. The community flow usually looks like:
Find the tool → View / Get → Import to WebUI → Save under Workspace → Tools
The important part is that the tool must end up saved in Open WebUI’s Workspace → Tools area before a model can use it.
Security note: Open WebUI tools and functions execute Python code on the server. That is powerful, but it also means you should only install tools from sources you trust.
This is not browser decoration. This is server-side code execution wearing a friendly UI button.
Configuring the Web Search and Crawl Tool
Once the tool is imported, configure its valves.
For the simple setup in this post, the intended path is:
Open WebUI native search → Web Search and Crawl tool → Crawl4AI
That means the tool should use Open WebUI’s native web search to find URLs, then send those URLs to Crawl4AI for extraction.
The important valve settings are:
USE_NATIVE_SEARCH = true SEARCH_WITH_SEARXNG = false CRAWL4AI_BASE_URL = http://crawl4ai:11235 DEBUG = true MORE_STATUS = true
This deliberately keeps SearXNG out of the stack.
SearXNG is useful in some setups, but it is not required here. The first working version of this stack uses Open WebUI’s native search as the search layer and Crawl4AI as the crawler/extractor layer.
The resulting flow is:
User asks a web question → model calls Web Search and Crawl → tool asks Open WebUI native search for URLs → tool sends URLs to Crawl4AI → Crawl4AI extracts readable page content → model answers from extracted context
Crawl4AI itself is not the search engine. It only crawls URLs handed to it.
Attaching the Tool to the Model Profile
Importing a tool into Workspace does not automatically mean every model can use it.
The model profile must be configured to include that tool.
In Open WebUI, the flow is roughly:
Workspace → Models → select your model profile → Tools → select Web Search and Crawl → Save & Update
This distinction matters. A tool can exist in the Workspace and still not be available to a specific model profile.
If the model is not calling the tool, check both places:
Workspace → Tools Model profile → Tools
Some Open WebUI versions also let you enable tools from the chat input area using a plus button or tool selector. If a tool exists but is not being called, check both the model profile and the active chat-level tool selection.
Function Calling Must Be Native
The key model setting is:
Function Calling: Native
Set the model profile to:
Function Calling = Native
This matters because Open WebUI’s modern tool behavior depends on native function calling.
Without Native function calling, the model may still chat normally, but the richer tool behavior may not work.
Native function calling allows the model to make structured tool calls and decide when a tool is needed.
In practice:
Default / legacy behavior = may not call tools reliably Native function calling = tool-capable model behavior
So if web search, memory, or chat-history tools do not seem to work, check this first.
Enabling Built-in Tools: Memory, Chat History, Notes, and Knowledge Base
The web crawling tool is only one part of the experience.
The surprising part was cross-thread recall: the model could answer questions about earlier Open WebUI conversations.
That did not come from Crawl4AI.
It came from Open WebUI’s built-in tools.
In the model profile, enable the relevant built-in tools:
Memory Chat History Notes Knowledge Base
This is what allows the model to search previous Open WebUI threads.
The rough flow is:
User asks about an earlier thread → model calls Chat History search → Open WebUI retrieves matching prior conversations → model summarizes the relevant results
That is different from ordinary model memory.
The model is not remembering every previous chat in its weights. It is using Open WebUI’s stored conversation history as a searchable local archive.
What “Memory Across Threads” Actually Means
It is tempting to say the model “remembers” previous threads.
A more accurate description is:
The model can retrieve previous Open WebUI threads when it has tool access.
That distinction matters.
The architecture is:
Open WebUI database → stored chats → memories → notes → knowledge bases → retrieval tools → current model context
The model is not omniscient.
It can only retrieve what Open WebUI has stored and what its enabled tools allow it to search.
That is actually a better design. It means recall can be scoped, inspectable, and local, rather than pretending the model has infinite magical context.
Verifying Cross-Thread Recall
A simple test is to ask the model about a known previous Open WebUI thread:
Search chat history for “AI Realism Analysis” and summarize what we discussed.
A successful response should show that the model used tools such as:
search_chats view_chat search_memories query_knowledge_files
The exact tool names may vary by Open WebUI version and configuration, but the behavior is the same: the model searches local Open WebUI history, retrieves relevant content, and summarizes it.
This is the moment the system starts to feel less like a disposable chatbot and more like a local assistant with bridge logs.
Again, not magic.
Just retrieval.
But useful retrieval.
Open WebUI’s Native RAG Layer
Open WebUI also has its own native RAG pipeline for knowledge bases, uploaded files, notes, and related context.
Conceptually, that looks like this:
Files / knowledge / notes / chat history → text extraction and chunking → embeddings → vector retrieval → prompt context → model answer
Depending on configuration, Open WebUI can use its default local vector storage or an external vector database.
For a solo local setup, the default path is often enough. If the system grows into a larger personal knowledge base with manuals, boat logs, code notes, essays, and long-running project history, then it may be worth moving the vector layer into a more explicit service boundary.
But the key point is that Open WebUI already provides more than a chat box. It can act as a local retrieval layer over your own material.
Debugging Checklist
If the model disappears from Open WebUI, check the model endpoint:
curl -k https://your-local-model-domain.example/v1/models
If that works from the host, test from inside Open WebUI:
docker exec -it open-webui sh -lc \ 'curl -k https://your-local-model-domain.example/v1/models'
If Crawl4AI does not work, test it from inside Open WebUI:
docker exec -it open-webui sh -lc \ 'curl -sS http://crawl4ai:11235 | head'
If that fails with DNS errors, Open WebUI and Crawl4AI are not on the same Docker network.
If the model is slow to appear after startup, the local model may still be loading. Watch the llama.cpp logs:
docker logs --tail=100 -f gemma-local
If web search returns nothing and Crawl4AI stays quiet, native search probably did not return URLs or the web tool did not invoke crawling.
Watch logs:
docker compose -f docker-compose.31b.native-search-crawl4ai.yml logs --tail=100 -f open-webui crawl4ai gemma
Setup Checklist
At a high level, the working setup requires:
Docker: [ ] llama.cpp model server running [ ] Open WebUI running [ ] Crawl4AI running [ ] all services on the same Docker network [ ] Open WebUI volume preserved Open WebUI Workspace: [ ] Web Search and Crawl tool imported [ ] Web Search and Crawl tool saved under Workspace → Tools [ ] tool valves configured for native search + Crawl4AI Model Profile: [ ] Function Calling set to Native [ ] Web Search and Crawl attached under Tools [ ] Memory enabled [ ] Chat History enabled [ ] Notes enabled if desired [ ] Knowledge Base enabled if desired Verification: [ ] model endpoint returns /v1/models [ ] Open WebUI can reach the model endpoint [ ] Open WebUI can reach http://crawl4ai:11235 [ ] model can call web search/crawl [ ] model can search prior Open WebUI threads
The stack only feels seamless after all three layers are configured: containers, tools, and model profile.
If any one of those is missing, the system may still boot, but the assistant will behave like it has no eyes, no memory, or no hands.
Local AI is very dignified like that.
What This Setup Gives You
This stack gives you a local AI interface with several useful layers:
Local model inference Open WebUI chat interface HTTPS model routing through Caddy or another reverse proxy Native Open WebUI web search Crawl4AI web extraction Open WebUI memory Open WebUI chat-history recall Open WebUI knowledge/file RAG
That combination is enough to make the assistant feel meaningfully continuous without handing everything to a cloud service.
It can answer from the current chat, retrieve earlier Open WebUI conversations, remember stable facts, search the web, crawl pages, and reason over the extracted content.
That is the interesting part. The model is not just generating text from a frozen prompt. It is operating over a local archive, local tools, and local infrastructure.
Privacy and Scope
This setup is local-first, but “local-first” does not automatically mean “private by default.”
A few cautions:
Do not publish real API keys. Do not publish private hostnames unless intentional. Do not mount sensitive folders casually. Do not expose Open WebUI or model APIs publicly without authentication. Do not assume web search providers are private. Do not assume every tool call is harmless.
The model can become more useful when it can search memory, chat history, files, and the web. But the same capability also means tool scope matters.
A good local assistant should be able to retrieve context without rummaging through everything unnecessarily.
The Bigger Point
A local AI assistant becomes much more useful when it has three things:
- A competent local model.
- Tool access.
- Retrieval over your own history and documents.
The model alone is just a brain in a jar.
Open WebUI gives it a body.
Memory and chat-history retrieval give it continuity.
Crawl4AI gives it eyes on the web.
The result is not perfect. It still needs careful configuration, Docker networking can still be petty, and local models still have limits. But the shape is right.
You do not need a giant cloud system to get something useful. You need a local model, a stable interface, a persistent data volume, and tools that let the model reach beyond the current prompt.
Once those pieces are wired together, the experience changes.
It stops feeling like a disposable chatbot and starts feeling like a local system you can build a working relationship with.
Not magic.
Just plumbing.
But good plumbing is underrated.