version: '2.4'
services:
ollama:
image: ollama/ollama:0.5.4
container_name: llm-server
mem_limit: 8G
cpus: 5
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
environment:
- OLLAMA_HOST=0.0.0.0
- OLLAMA_NUM_THREAD=5
- OLLAMA_MAX_LOADED_MODELS=1
- OLLAMA_NUM_PARALLEL=1
- OLLAMA_NUM_CTX=512
- OLLAMA_NUM_PREDICT=128
- OLLAMA_KEEP_ALIVE=10m
networks:
- llm-network
open-webui:
image: ghcr.io/open-webui/open-webui:0.4.5
container_name: open-webui-admin
mem_limit: 512M
cpus: 1
restart: unless-stopped
environment:
- OLLAMA_BASE_URL=http://ollama:11434
- WEBUI_AUTH=true
- ENABLE_SIGNUP=false
volumes:
- webui_data:/app/backend/data
depends_on:
- ollama
networks:
- llm-network
- web
volumes:
ollama_data:
webui_data:
networks:
llm-network:
driver: bridge
web:
external: true
A chill but local AI stack:
Clean, deterministic, and future-facing.
11434webCreate it if missing:
docker network create web
β οΈ Replace subnet and gateway in llm-network with valid private IP ranges
(example: 172.20.0.0/24).
docker compose up -d
Check status:
docker compose ps
Logs:
docker compose logs -f ollama
docker compose logs -f open-webui
Stop (keep data):
docker compose down
Stop + wipe everything (β οΈ models included):
docker compose down -v
From host:
http://localhost:11434
Not exposed by ports.
Expected to be published through your reverse proxy on the external web network.
All commands run inside the Ollama container.
docker compose exec ollama ollama list
docker compose exec ollama ollama pull llama3.2
docker compose exec ollama ollama run llama3.2
docker compose exec ollama ollama ps
docker compose exec ollama ollama stop llama3.2
docker compose exec ollama ollama rm llama3.2
OLLAMA_HOST=0.0.0.0
Bind Ollama to all interfaces.
OLLAMA_MAX_LOADED_MODELS=1
Max number of models kept in memory.
Lower = predictable RAM
Higher = faster model switching
OLLAMA_NUM_PARALLEL=1
Concurrent inference requests per model.
OLLAMA_KEEP_ALIVE=10m
How long a model stays warm in RAM π₯
These:
OLLAMA_NUM_THREAD
OLLAMA_NUM_CTX
OLLAMA_NUM_PREDICT
are runtime/model parameters β not reliable server env vars.
Best practice:
Example:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Explain KV cache",
"options": {
"num_ctx": 4096
}
}'
Global default context:
OLLAMA_CONTEXT_LENGTH=4096
OLLAMA_BASE_URL=http://ollama:11434
WEBUI_AUTH=true
Enable authentication β
ENABLE_SIGNUP=false
Disable public registration π
Result:
π§ Stable
π― Predictable
π₯ Low latency for repeat prompts
To scale:
Volumes:
Backup example:
docker run --rm -v ollama_data:/data -v "$PWD":/backup alpine \
tar -czf /backup/ollama_data.tgz -C /data .
docker run --rm -v webui_data:/data -v "$PWD":/backup alpine \
tar -czf /backup/webui_data.tgz -C /data .
π§ Local AI.
π³ Containerized.
π Ready for tomorrow.