Tested on: Ubuntu 24.04 LTS, Ollama 0.30.9 (June 2026 stable line), NVIDIA driver 555-series with CUDA 12.x, Caddy 2.8 as the reverse proxy. The Ollama binary moves fast; verify the systemd unit and environment-variable names against the release you are installing.

This guide assumes you have already worked through ubuntu-baseline or rhel-baseline and the ssh-hardening guide. Ollama is layered on top of a defensible host, not a fresh provider image. If the underlying box is not hardened, hardening the model server on top of it is theatre.

Scope

Applies to self-hosted Ollama installed via the official install.sh script on a Linux host that is or will be exposed to a private network or to authenticated internet traffic. Covers the bind / authentication / resource-control surface that Ollama exposes, not the application-layer concerns (prompt injection, RAG data flows) which belong to agentic-ai-pipeline-reference.

Compliance references — where they apply — sit on the section headings; per-item NIS2 and ISO 27001 mappings are not padded.

What Ollama actually is, and where the attack surface lives

Ollama is a wrapper around llama.cpp (and a small number of other inference backends). It exposes:

  • A CLI (ollama run, ollama pull, ollama list)
  • An HTTP API on TCP 11434
  • A local model store at /usr/share/ollama/.ollama/models
  • A pull-from-registry workflow (default: registry.ollama.ai)

The official installer creates a system user ollama:ollama, drops the binary at /usr/bin/ollama, and writes a minimal systemd unit at /etc/systemd/system/ollama.service. The default posture is single-user, single-host, localhost only — fine for a laptop, not defensible for anything else.

The surfaces worth hardening, in order of blast radius:

  1. Network bind. The OLLAMA_HOST environment variable decides what the HTTP API listens on. Misconfigured, this is the entire game — an unauthenticated /api/generate, /api/pull, /api/delete exposed to the public internet.
  2. Reverse-proxy posture. Even on the right interface, the API has no built-in authentication. Anything reaching port 11434 can pull models, run inference, and delete the model store. Auth lives at the proxy.
  3. systemd unit. The default unit has none of the sandboxing flags systemd provides — process namespaces, kernel protection, capability bounding, syscall filtering. All free, all needed.
  4. GPU access. The ollama user is added to render and video so it can talk to /dev/nvidia*. Anyone else with the same group membership inherits the same access.
  5. Model storage. Models are large, slow to re-pull, and represent operational investment. The default permissions are loose.
  6. Update mechanism. The official install script downloads and runs an unpinned binary. Fine for a homelab; not for production change management.

1. Bind safety (NIS2: 21(2)(e); ISO 27001: A.8.20)

Default is correct: Ollama binds 127.0.0.1:11434. The dangerous config is OLLAMA_HOST=0.0.0.0:11434, which the installer never sets but many tutorials suggest. Verify before changing anything:

1ss -tlnp | grep 11434

Expect 127.0.0.1:11434. If it shows 0.0.0.0:11434 or [::]:11434, the API is open on every interface — the host firewall is the only thing between the model server and the internet.

If Ollama must accept traffic from another machine on a private network (a separate web tier, a wireguard peer), bind to that specific interface:

1# /etc/systemd/system/ollama.service.d/override.conf
2[Service]
3Environment="OLLAMA_HOST=10.0.0.5:11434"

Never 0.0.0.0. Bind to the exact address you need. Reload and verify:

1systemctl daemon-reload
2systemctl restart ollama
3ss -tlnp | grep 11434

2. systemd unit hardening (NIS2: 21(2)(e); ISO 27001: A.8.25)

The installer’s default unit looks roughly like this:

1[Service]
2ExecStart=/usr/bin/ollama serve
3User=ollama
4Group=ollama
5Restart=always
6RestartSec=3

That is the floor. The hardened drop-in:

 1# /etc/systemd/system/ollama.service.d/override.conf
 2[Service]
 3# Process isolation
 4NoNewPrivileges=true
 5PrivateTmp=true
 6PrivateDevices=false   # set to true if NOT using a GPU
 7ProtectSystem=strict
 8ProtectHome=true
 9ProtectKernelTunables=true
10ProtectKernelModules=true
11ProtectKernelLogs=true
12ProtectControlGroups=true
13ProtectClock=true
14RestrictNamespaces=true
15RestrictSUIDSGID=true
16RestrictRealtime=true
17LockPersonality=true
18MemoryDenyWriteExecute=false   # llama.cpp uses JIT in some paths
19# Filesystem
20ReadWritePaths=/usr/share/ollama
21# Capabilities
22AmbientCapabilities=
23CapabilityBoundingSet=
24# Syscall filter
25SystemCallFilter=@system-service
26SystemCallFilter=~@privileged @resources
27SystemCallErrorNumber=EPERM
28# Resource controls (size to your largest model + headroom)
29MemoryMax=48G
30TasksMax=4096

Two notes that bite people:

  • PrivateDevices=true breaks GPU access. If you set it, CUDA cannot find /dev/nvidia*. Leave false on GPU hosts; set true on CPU-only hosts.
  • MemoryDenyWriteExecute=true breaks the JIT. Some llama.cpp inference paths require W+X memory. Leave false.

Apply and verify:

1systemctl daemon-reload
2systemctl restart ollama
3systemctl show ollama -p NoNewPrivileges,ProtectSystem,PrivateTmp,MemoryMax

Expect each property to be set to the value in your override. The systemd-analyze security ollama score is a useful before-and-after metric — the default unit scores around 9.6 (“unsafe”); the hardened unit drops to around 3.0 (“medium”) with the GPU caveats above. CPU-only hosts can get below 2.0.

3. Reverse-proxy posture: TLS + auth + rate limits (NIS2: 21(2)(e),(h),(j); ISO 27001: A.5.14, A.8.5, A.8.24)

Ollama’s API has no authentication. The default ollama serve accepts any caller on the bind interface and answers any prompt, including /api/pull (download an arbitrary model) and /api/delete (remove a local model). Authentication and TLS live at the reverse proxy.

A minimal Caddy front-end on a separate hostname:

 1ai.internal.example.com {
 2    encode gzip
 3
 4    # Block administrative endpoints entirely from outside
 5    @admin path /api/pull* /api/push* /api/delete* /api/create*
 6    handle @admin {
 7        respond 404
 8    }
 9
10    # Require basic auth on the rest
11    basic_auth /api/* {
12        ops $2a$14$REDACTED_BCRYPT_HASH
13    }
14
15    reverse_proxy 127.0.0.1:11434 {
16        header_up Host {host}
17        header_up X-Real-IP {remote_host}
18    }
19}

What that does:

  • pull, push, delete, create endpoints respond 404 to external callers regardless of credentials. Those operations stay on the host, via the ollama CLI run as the operator.
  • The remaining endpoints (/api/generate, /api/chat, /api/embeddings, /api/tags) require HTTP basic auth over TLS. Caddy handles certificate provisioning automatically via Let’s Encrypt.
  • TLS to the model server itself stays plaintext on 127.0.0.1 — the loopback hop is the trust boundary.

Generate the bcrypt hash with caddy hash-password. Rotate by issuing a new hash and editing the basic_auth block; reload Caddy. For multi-user production, replace basic auth with a forward-auth handler against an identity provider or with a LiteLLM proxy that does per-key accounting — covered in the planned litellm-gateway-hardening piece, not here.

Rate limiting is plugin territory in Caddy; the simplest production-acceptable answer is to put Ollama behind an Nginx front-end with limit_req instead (see nginx-ratelimit for the patterns).

4. GPU access scoping (ISO 27001: A.8.2, A.8.5)

The installer adds the ollama user to render and video, which grants read/write access to /dev/nvidia0, /dev/nvidiactl, and /dev/nvidia-uvm. Anyone else in those groups inherits the same access.

Audit it:

1getent group video render | awk -F: '{print $1": "$4}'

The output should list service accounts only. If a human user is in video or render, they can talk to the GPU directly, bypassing Ollama and any rate limits at the proxy. For a host dedicated to model serving, remove human accounts from those groups:

1gpasswd -d alice video
2gpasswd -d alice render

For the systemd unit, the device allowances belong with the hardening drop-in:

1# Append to /etc/systemd/system/ollama.service.d/override.conf
2DevicePolicy=closed
3DeviceAllow=/dev/nvidia0 rw
4DeviceAllow=/dev/nvidiactl rw
5DeviceAllow=/dev/nvidia-uvm rw
6DeviceAllow=/dev/nvidia-uvm-tools rw

DevicePolicy=closed defaults to denying every device node; the DeviceAllow lines open only what is needed. Adjust device numbers if you have more than one GPU.

5. Model storage permissions and registry trust (GDPR: Article 32; ISO 27001: A.8.10, A.8.24)

Default location: /usr/share/ollama/.ollama/models. Owned by ollama:ollama but the directory permissions are 755 — anyone on the box can list the model store and read the blobs. Tighten:

1chown -R ollama:ollama /usr/share/ollama
2chmod 700 /usr/share/ollama
3chmod 700 /usr/share/ollama/.ollama
4find /usr/share/ollama -type d -exec chmod 700 {} +
5find /usr/share/ollama -type f -exec chmod 600 {} +

For a host that handles confidential prompts or runs fine-tuned models that represent operator IP, mount the model store on a LUKS-encrypted partition. The patterns from encrypted-backups-restic apply — encryption at rest is cheap insurance against a stolen disk.

Registry trust. Ollama pulls models from registry.ollama.ai by default and verifies the SHA-256 of every blob against the manifest before saving. That handles integrity-in-transit. It does not verify the model contents are safe — a publisher account that gets compromised can ship a poisoned model that downloads and runs cleanly. Two mitigations:

  • Pin models by digest (ollama pull llama3:8b@sha256:abc...) in any automation, so a re-pull after a registry incident does not silently update.
  • Mirror trusted models to an internal registry (set OLLAMA_REGISTRY to your mirror’s URL). Same posture as pinning a pip or npm mirror.

6. Logging and audit (NIS2: 21(2)(b); ISO 27001: A.8.15, A.8.16)

Ollama logs go to journald via the systemd unit:

1journalctl -u ollama --since "1 hour ago"

Two log streams worth thinking about:

  • Service lifecycle. Start, stop, errors, model loads. Forward to the central log target along with the rest of the host’s syslog stream — same pipeline as the log-minimisation-recipes guide describes.
  • Request stream. Every /api/generate and /api/chat request includes the prompt. Ollama itself does not log prompts by default, but the reverse-proxy access log does (URL + method + status). Decide before you turn anything on whether prompts may legally be persisted — they often contain personal data or confidential business information. If they may not, configure the proxy to log without bodies and to drop the access log line on the /api/generate and /api/chat paths, or to hash the request body before logging.

The privacy framing here is the same as the privacy-by-design-server-build guide: decide retention and redaction policy before you ship, not after.

7. Updates and maintenance (NIS2: 21(2)(e); ISO 27001: A.8.8)

The official update path is to re-run the install script:

1curl -fsSL https://ollama.com/install.sh | sh

That is fine for a homelab. For production, pin the version and verify:

 1# Pin to a specific release
 2OLLAMA_VERSION=0.30.9
 3curl -fsSL -o /tmp/ollama \
 4    "https://github.com/ollama/ollama/releases/download/v${OLLAMA_VERSION}/ollama-linux-amd64"
 5curl -fsSL -o /tmp/ollama.sha256 \
 6    "https://github.com/ollama/ollama/releases/download/v${OLLAMA_VERSION}/ollama-linux-amd64.sha256"
 7
 8# Verify the checksum before installing
 9( cd /tmp && sha256sum -c ollama.sha256 ) || { echo "checksum mismatch"; exit 1; }
10
11# Replace the binary
12systemctl stop ollama
13install -o root -g root -m 0755 /tmp/ollama /usr/bin/ollama
14systemctl start ollama

Wrap that in a small script that records the version and date in the host’s change log. The same unattended-upgrades cadence that handles OS packages does not handle Ollama — a manual update cycle, scheduled, is the right answer.

Before updating: snapshot the host if you can, and ollama list

  • ollama show <model> to record the models you have so a roll forward (or back) is reproducible.

Verification — the smoke-test sequence

After every change to the unit, the proxy config, or an update, run the same checks:

 1# Service active and bound to the intended interface only
 2systemctl is-active ollama
 3ss -tlnp | grep 11434
 4
 5# Hardening flags actually applied
 6systemctl show ollama -p \
 7    NoNewPrivileges,ProtectSystem,ProtectHome,PrivateTmp,MemoryMax
 8
 9# Security score (before-and-after metric)
10systemd-analyze security ollama | head -5
11
12# Model store permissions tight
13stat -c '%U:%G %a' /usr/share/ollama /usr/share/ollama/.ollama
14
15# GPU device controls in effect (skip on CPU-only hosts)
16systemctl show ollama -p DevicePolicy,DeviceAllow
17
18# Reverse proxy returns 401 without credentials
19curl -sI https://ai.internal.example.com/api/tags | head -1
20
21# Reverse proxy blocks admin endpoints with 404
22curl -sI -u ops:pass https://ai.internal.example.com/api/pull | head -1
23
24# Authenticated read endpoint works
25curl -s -u ops:pass https://ai.internal.example.com/api/tags \
26    | jq '.models | length'

Three minutes end-to-end. They catch every regression I have hit on this stack so far.

What this guide deliberately does not cover

  • LiteLLM, OpenAI-compatible gateways, per-key quotas. That is the planned litellm-gateway-hardening piece — Ollama is the model serving layer, LiteLLM is the policy and accounting layer in front of it. They harden separately.
  • Prompt-injection defences and RAG-layer data security. Covered by agentic-ai-pipeline-reference.
  • Vector databases. Planned qdrant-hardening and pgvector-on-postgres pieces.
  • Multi-node serving / load balancing. Ollama is single-node by design; the planned vllm-production-baseline piece covers the multi-node case with vLLM, where the operational shape is meaningfully different.
  • AMD ROCm and Intel Gaudi GPU stacks. NVIDIA-only here. The systemd device allowances follow the same pattern but the device nodes differ.
  • Fine-tuning workflows and model hosting. Out of scope.