2026-llm

πŸš€ Self-Hosted LLM Stack (Ollama + Open WebUI)

version: '2.4'

services:
  ollama:
    image: ollama/ollama:0.5.4
    container_name: llm-server
    mem_limit: 8G
    cpus: 5
    restart: unless-stopped
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0
      - OLLAMA_NUM_THREAD=5
      - OLLAMA_MAX_LOADED_MODELS=1
      - OLLAMA_NUM_PARALLEL=1
      - OLLAMA_NUM_CTX=512
      - OLLAMA_NUM_PREDICT=128
      - OLLAMA_KEEP_ALIVE=10m
    networks:
      - llm-network

  open-webui:
    image: ghcr.io/open-webui/open-webui:0.4.5
    container_name: open-webui-admin
    mem_limit: 512M
    cpus: 1
    restart: unless-stopped
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - WEBUI_AUTH=true
      - ENABLE_SIGNUP=false
    volumes:
      - webui_data:/app/backend/data
    depends_on:
      - ollama
    networks:
      - llm-network
      - web
volumes:
  ollama_data:
  webui_data:

networks:
  llm-network:
    driver: bridge
  web:
    external: true

A chill but local AI stack:

Clean, deterministic, and future-facing.


πŸ“¦ What’s inside


βœ… Prerequisites

Create it if missing:

docker network create web

⚠️ Replace subnet and gateway in llm-network with valid private IP ranges
(example: 172.20.0.0/24).


▢️ Start the stack

docker compose up -d

Check status:

docker compose ps

Logs:

docker compose logs -f ollama
docker compose logs -f open-webui

Stop (keep data):

docker compose down

Stop + wipe everything (⚠️ models included):

docker compose down -v

🌐 Access

Ollama API

From host:

http://localhost:11434

Open WebUI

Not exposed by ports.
Expected to be published through your reverse proxy on the external web network.


🧩 Model management (inside Docker)

All commands run inside the Ollama container.


πŸ“‹ List models

docker compose exec ollama ollama list

⬇️ Install a model

docker compose exec ollama ollama pull llama3.2

▢️ Run a model

docker compose exec ollama ollama run llama3.2

πŸ‘€ Loaded models

docker compose exec ollama ollama ps

πŸ›‘ Unload model from RAM

docker compose exec ollama ollama stop llama3.2

πŸ—‘οΈ Delete model from disk

docker compose exec ollama ollama rm llama3.2

βš™οΈ Environment variables

🧠 Ollama


OLLAMA_HOST

OLLAMA_HOST=0.0.0.0

Bind Ollama to all interfaces.


OLLAMA_MAX_LOADED_MODELS

OLLAMA_MAX_LOADED_MODELS=1

Max number of models kept in memory.

Lower = predictable RAM
Higher = faster model switching


OLLAMA_NUM_PARALLEL

OLLAMA_NUM_PARALLEL=1

Concurrent inference requests per model.


OLLAMA_KEEP_ALIVE

OLLAMA_KEEP_ALIVE=10m

How long a model stays warm in RAM πŸ”₯


⚠️ Important

These:

OLLAMA_NUM_THREAD
OLLAMA_NUM_CTX
OLLAMA_NUM_PREDICT

are runtime/model parameters β€” not reliable server env vars.

Best practice:

Example:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Explain KV cache",
  "options": {
    "num_ctx": 4096
  }
}'

Global default context:

OLLAMA_CONTEXT_LENGTH=4096

πŸ–₯️ Open WebUI

OLLAMA_BASE_URL

OLLAMA_BASE_URL=http://ollama:11434

WEBUI_AUTH

WEBUI_AUTH=true

Enable authentication βœ…


ENABLE_SIGNUP

ENABLE_SIGNUP=false

Disable public registration πŸ”


πŸ“ Capacity model (current setup)

Result:

🧘 Stable
🎯 Predictable
πŸ”₯ Low latency for repeat prompts

To scale:


πŸ’Ύ Persistence

Volumes:

Backup example:

docker run --rm -v ollama_data:/data -v "$PWD":/backup alpine \
tar -czf /backup/ollama_data.tgz -C /data .

docker run --rm -v webui_data:/data -v "$PWD":/backup alpine \
tar -czf /backup/webui_data.tgz -C /data .

🧠 Local AI.
🐳 Containerized.
πŸš€ Ready for tomorrow.