Set up Ollama for free local embeddings and LLM — no API keys, no rate limits

Ollama (Local Embeddings & LLM)

Ollama provides free, unlimited local embeddings for memory search — no API keys, no rate limits, no cloud dependency.

Why Ollama?

Problem Solution
node-llama-cpp crashes on Apple Silicon (Metal GPU bug) Ollama handles Metal natively
Gemini/OpenAI free tier hits rate limits (429) Local = zero API calls
Memory search fails when quota exhausted Always available offline

Supported Platforms

Platform Status Notes
macOS (Apple Silicon M1–M4) ✅ Tested Metal GPU acceleration
macOS (Intel) ✅ Works CPU only, slower
Ubuntu / Debian ✅ Works NVIDIA GPU optional (CUDA auto-detected)
WSL2 (Windows) ✅ Works GPU passthrough with NVIDIA
Windows native ✅ Works Direct install from ollama.com

Install

macOS

# Install
brew install ollama

# Start as background service (auto-starts on login)
brew services start ollama

# Pull embedding model (274 MB)
ollama pull nomic-embed-text

# Optional: chat model for local tasks (2 GB)
ollama pull llama3.2:3b

Ubuntu / Debian

# Install (one-liner)
curl -fsSL https://ollama.com/install.sh | sh

# Ollama starts automatically via systemd
systemctl status ollama

# Pull models
ollama pull nomic-embed-text
ollama pull llama3.2:3b   # optional

NVIDIA GPU (optional): Ollama auto-detects CUDA if drivers are installed (nvidia-smi should work).

WSL2

# Option A: Install inside WSL2 (recommended)
curl -fsSL https://ollama.com/install.sh | sh

# Option B: Install on Windows natively (download from ollama.com)
# Then access from WSL2 at http://host.docker.internal:11434

Verify

ollama list
curl -s http://localhost:11434/v1/models

Zirkabot Configuration

Set memorySearch in your zirkabot.json to use Ollama's OpenAI-compatible API:

{
  "agents": {
    "defaults": {
      "memorySearch": {
        "provider": "openai",
        "model": "nomic-embed-text",
        "remote": {
          "apiKey": "ollama",
          "baseUrl": "http://localhost:11434/v1"
        },
        "fallback": "gemini",
        "query": {
          "hybrid": {
            "enabled": true,
            "vectorWeight": 0.7,
            "textWeight": 0.3
          }
        },
        "experimental": { "sessionMemory": true },
        "sources": ["memory", "sessions"],
        "cache": {
          "enabled": true
        }
      }
    }
  }
}
Field Value Why
provider "openai" Ollama exposes an OpenAI-compatible API
remote.baseUrl http://localhost:11434/v1 Ollama's local endpoint
remote.apiKey "ollama" Required by provider, ignored by Ollama
model "nomic-embed-text" 768-dim embeddings, fast, good quality
fallback "gemini" Optional cloud fallback if Ollama is down
experimental.sessionMemory true Index past conversations for search
sources ["memory", "sessions"] Search both memory files and session transcripts

For WSL2 accessing Windows-hosted Ollama, use http://host.docker.internal:11434/v1 as the baseUrl.


Resource Usage

Model Disk RAM (loaded) Purpose
nomic-embed-text 274 MB ~300 MB Memory search embeddings
llama3.2:3b 2.0 GB ~2 GB Local chat (optional)

Ollama unloads models from RAM after 5 minutes of inactivity.

Alternative Embedding Models

Model Dimensions Size Notes
nomic-embed-text 768 274 MB Recommended — good balance
mxbai-embed-large 1024 670 MB Higher quality, more RAM
all-minilm 384 46 MB Smallest, fastest

To switch models: ollama pull <model> then update "model" in config.


Troubleshooting

Ollama not responding

# macOS
brew services restart ollama

# Linux
sudo systemctl restart ollama

Memory search returns empty after switching

The memory index rebuilds when the embedding provider/model changes. Give it a moment on first search, or restart Zirkabot.

MLX warning on macOS

WARN MLX dynamic library not available

Harmless — Ollama uses Metal instead. No action needed.

WSL2 can't reach Ollama on Windows

Use http://host.docker.internal:11434/v1 instead of localhost.