Saturday, March 21

If you’ve been paying $20–50/month for API calls to run a model that mostly does document summarisation or code completion, the Ollama self-hosted LLM setup will pay for itself in a week. Ollama wraps Llama, Mistral, Gemma, and a dozen other open-source models in a dead-simple interface that runs locally, costs nothing per inference, and exposes an OpenAI-compatible REST API you can drop into existing code with a one-line change. This guide covers installation on Windows, Mac, and Linux, model management, API configuration, and how to wire it into real applications — including n8n and Python agents.

Why Bother Self-Hosting at All?

The honest answer is: it depends on your workload. If you’re running GPT-4-level reasoning tasks, you’re not replacing that with a local 7B model. But a huge slice of real production workloads — classification, extraction, summarisation, first-pass code review, RAG retrieval — don’t need frontier model quality. They need speed, privacy, and zero marginal cost.

Ollama specifically solves the setup friction that made local models annoying before. You used to need CUDA drivers, Python environment hell, llama.cpp compilation, and a custom API shim. Ollama handles all of that and gives you a clean CLI and HTTP API in about five minutes. The tradeoff is that it’s designed for developer use, not production serving at scale — more on that later.

When Local Models Actually Make Sense

  • Privacy-sensitive data — medical records, legal documents, internal financials that can’t leave your infrastructure
  • High-volume, low-complexity tasks — if you’re running 10,000 classification calls a day, even cheap API pricing adds up fast
  • Offline or air-gapped environments — edge deployments, on-prem enterprise setups
  • Prototyping and experimentation — iterate on prompts without watching your API budget evaporate

Installing Ollama: Platform-by-Platform

macOS

The macOS install is the smoothest. Download the .dmg from ollama.com, drag it to Applications, and launch it. Ollama runs as a menu bar app and starts a local server on http://localhost:11434 automatically. Metal GPU acceleration works out of the box on Apple Silicon — M1/M2/M3 chips handle 7B and 13B models comfortably, with M2 Pro/Max pushing 70B models at usable speeds.

Alternatively, if you prefer CLI-only:

# macOS via Homebrew (installs as a background service)
brew install ollama
ollama serve  # starts the server manually if not running as a service

Linux

One-liner install that pulls and runs the official install script:

curl -fsSL https://ollama.com/install.sh | sh

This sets up a systemd service that starts on boot. NVIDIA GPU support is automatic if your CUDA drivers are already installed (CUDA 11.3+). AMD GPU support via ROCm works but is less reliable — test it, don’t assume it. If you’re on CPU-only hardware, expect slow inference on anything larger than 7B; a 7B model on a modern CPU runs at roughly 5–8 tokens/second, which is usable for non-interactive workflows.

To check if Ollama picked up your GPU:

ollama run llama3.2  # start a model
# In another terminal:
ollama ps            # shows running models and whether GPU layers are loaded

Windows

Download the Windows installer from ollama.com. It installs a background service and adds ollama to your PATH. NVIDIA GPU acceleration works with the same CUDA requirement. WSL2 is supported if you prefer a Linux environment — the server runs in WSL and is accessible from Windows apps via localhost:11434.

One gotcha on Windows: the default model storage location is C:\Users\<you>\.ollama\models. These models are large (4–40GB each). Set the OLLAMA_MODELS environment variable to redirect storage to a drive with more space before you start pulling models.

# Windows PowerShell — set model path to a different drive
$env:OLLAMA_MODELS = "D:\ollama-models"
ollama serve

Pulling and Running Models

Ollama has its own model library at ollama.com/library. The naming convention is model:tag where tag specifies the parameter count and quantisation. Here are the models I’d actually recommend starting with:

  • llama3.2:3b — 2GB download, runs fast on CPU, good for simple tasks and prototyping
  • llama3.1:8b — 4.7GB, solid general-purpose model, runs well on 8GB RAM machines
  • mistral:7b — 4.1GB, slightly better at instruction-following than Llama for some tasks
  • codellama:13b — 7.4GB, specifically fine-tuned for code, better than base Llama for code completion
  • deepseek-coder-v2:16b — strong coding model, needs ~16GB RAM/VRAM
  • gemma2:27b — Google’s model, impressive reasoning at this size, needs ~20GB VRAM for full GPU
# Pull a model (downloads to local storage)
ollama pull llama3.1:8b

# Run interactively in terminal
ollama run llama3.1:8b

# Run with a specific system prompt
ollama run llama3.1:8b "Summarise the following text in three bullet points: ..."

# List downloaded models
ollama list

# Remove a model to free disk space
ollama rm codellama:13b

Quantisation matters here. The default tag usually gives you Q4_K_M quantisation, which is a good balance of size and quality. If you need more accuracy and have VRAM to spare, pull the :q8_0 variant. If you’re RAM-constrained, try :q4_0 but expect some quality degradation on complex reasoning.

The HTTP API: Where It Gets Useful

Ollama’s REST API is what makes it drop-in compatible with a huge range of tooling. The native API is at /api/generate and /api/chat, but the OpenAI-compatible endpoint at /v1/chat/completions is what you’ll use in practice.

from openai import OpenAI

# Point the OpenAI client at your local Ollama instance
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # required by the client but not validated by Ollama
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant. Be concise."},
        {"role": "user", "content": "Explain the difference between RAG and fine-tuning in two sentences."}
    ],
    temperature=0.7
)

print(response.choices[0].message.content)

This means any code you’ve written against the OpenAI SDK works against Ollama with two changes: base_url and model. That includes LangChain, LlamaIndex, and most agent frameworks — they all support custom base URLs.

Streaming Responses

# Streaming works the same way
stream = client.chat.completions.create(
    model="mistral:7b",
    messages=[{"role": "user", "content": "Write a Python function to parse CSV files"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)

Making Ollama Accessible Over a Network

By default, Ollama only listens on 127.0.0.1. To expose it to other machines on your network (or to a Docker container), set the bind address:

# Linux: set via environment variable in the systemd service
# Edit /etc/systemd/system/ollama.service and add:
# Environment="OLLAMA_HOST=0.0.0.0:11434"

# Or for a quick test without modifying the service:
OLLAMA_HOST=0.0.0.0:11434 ollama serve

# macOS: set before launching
OLLAMA_HOST=0.0.0.0:11434 ollama serve

Do not expose this to the public internet without authentication. Ollama has no built-in auth. Put it behind a reverse proxy (Nginx, Caddy) with basic auth or mTLS if you need external access. For internal LAN use only, binding to 0.0.0.0 is fine.

Integrating Ollama with n8n and Automation Workflows

n8n has a built-in Ollama node as of recent versions, but the OpenAI-compatible API approach is more flexible. In n8n, add an HTTP Request node and point it at your Ollama instance:

  • Method: POST
  • URL: http://<your-ollama-host>:11434/v1/chat/completions
  • Headers: Content-Type: application/json
  • Body: JSON with model, messages, and any other params

This approach works with Make (Integromat) too — anywhere you can make an HTTP request, you can call Ollama. The limitation is that n8n needs network access to your Ollama host, so if you’re running n8n in the cloud and Ollama locally, you’ll need a tunnel (ngrok, Cloudflare Tunnel) or just run both in the same Docker network.

Docker Compose Setup for n8n + Ollama

version: '3.8'
services:
  ollama:
    image: ollama/ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    # For NVIDIA GPU support, add:
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - capabilities: [gpu]

  n8n:
    image: n8nio/n8n
    ports:
      - "5678:5678"
    environment:
      - N8N_BASIC_AUTH_ACTIVE=true
      - N8N_BASIC_AUTH_USER=admin
      - N8N_BASIC_AUTH_PASSWORD=your_password
    depends_on:
      - ollama

volumes:
  ollama_data:

With this setup, n8n can reach Ollama at http://ollama:11434 using the Docker service name. Pull your model after the containers start: docker exec -it <ollama_container> ollama pull llama3.1:8b

Real Limitations You’ll Hit in Production

Ollama is genuinely good for what it is, but there are real rough edges:

  • Single-request concurrency by default. Ollama queues requests rather than batching them. Under concurrent load, requests wait. For high-throughput serving, look at vLLM or llama.cpp with proper server configuration instead.
  • Model load time. Cold-starting a 13B model takes 5–15 seconds. Ollama keeps models in memory for a configurable period (default 5 minutes) after the last request, so this only hits you on the first call after idle. Set OLLAMA_KEEP_ALIVE=24h to keep models loaded permanently.
  • Context window limits vary by quantisation. A model’s advertised context window might not be achievable at the quantisation level you’re using without running out of RAM. Test with your actual document sizes.
  • No built-in observability. There’s no dashboard, no request logging, no token counting in the response (though the API does return usage stats). Wire in something external if you care about monitoring.

Which Model to Actually Run for Your Use Case

I’d use llama3.1:8b as a starting point for most general tasks — it punches above its weight, runs on hardware most developers already have, and the instruction-following is reliable enough for production automation. For code specifically, deepseek-coder-v2:16b is noticeably better if your machine can handle it. For anything requiring strong reasoning or nuanced writing, be honest with yourself: a local 7B model won’t replace Claude Haiku (which costs roughly $0.25 per million input tokens), but it doesn’t need to for the right workloads.

The practical sweet spot for an Ollama self-hosted LLM setup is handling your high-volume, lower-stakes inference locally while routing complex or high-stakes requests to a frontier API. That hybrid approach gives you cost control without sacrificing quality where it actually matters.

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Share.
Leave A Reply