Tested on: Ubuntu 24.04 LTS, Ollama 0.30.9 (June 2026 stable line), NVIDIA driver 555-series with CUDA 12.x, Caddy 2.8 as the reverse proxy. The Ollama binary moves fast; verify the systemd unit and environment-variable names against the release you are installing.
This guide assumes you have already worked through
ubuntu-baseline or
rhel-baseline and the
ssh-hardening guide. Ollama is layered
on top of a defensible host, not a fresh provider image. If the
underlying box is not hardened, hardening the model server on top
of it is theatre.
Scope
Applies to self-hosted Ollama installed via the official
install.sh script on a Linux host that is or will be exposed
to a private network or to authenticated internet traffic. Covers
the bind / authentication / resource-control surface that Ollama
exposes, not the application-layer concerns (prompt injection,
RAG data flows) which belong to
agentic-ai-pipeline-reference.
Compliance references — where they apply — sit on the section headings; per-item NIS2 and ISO 27001 mappings are not padded.
What Ollama actually is, and where the attack surface lives
Ollama is a wrapper around llama.cpp (and a small number of
other inference backends). It exposes:
- A CLI (
ollama run,ollama pull,ollama list) - An HTTP API on TCP 11434
- A local model store at
/usr/share/ollama/.ollama/models - A pull-from-registry workflow (default:
registry.ollama.ai)
The official installer creates a system user ollama:ollama,
drops the binary at /usr/bin/ollama, and writes a minimal
systemd unit at /etc/systemd/system/ollama.service. The default
posture is single-user, single-host, localhost only — fine for
a laptop, not defensible for anything else.
The surfaces worth hardening, in order of blast radius:
- Network bind. The
OLLAMA_HOSTenvironment variable decides what the HTTP API listens on. Misconfigured, this is the entire game — an unauthenticated/api/generate,/api/pull,/api/deleteexposed to the public internet. - Reverse-proxy posture. Even on the right interface, the API has no built-in authentication. Anything reaching port 11434 can pull models, run inference, and delete the model store. Auth lives at the proxy.
- systemd unit. The default unit has none of the sandboxing flags systemd provides — process namespaces, kernel protection, capability bounding, syscall filtering. All free, all needed.
- GPU access. The
ollamauser is added torenderandvideoso it can talk to/dev/nvidia*. Anyone else with the same group membership inherits the same access. - Model storage. Models are large, slow to re-pull, and represent operational investment. The default permissions are loose.
- Update mechanism. The official install script downloads and runs an unpinned binary. Fine for a homelab; not for production change management.
1. Bind safety (NIS2: 21(2)(e); ISO 27001: A.8.20)
Default is correct: Ollama binds 127.0.0.1:11434. The dangerous
config is OLLAMA_HOST=0.0.0.0:11434, which the installer never
sets but many tutorials suggest. Verify before changing anything:
1ss -tlnp | grep 11434
Expect 127.0.0.1:11434. If it shows 0.0.0.0:11434 or [::]:11434,
the API is open on every interface — the host firewall is the
only thing between the model server and the internet.
If Ollama must accept traffic from another machine on a private network (a separate web tier, a wireguard peer), bind to that specific interface:
1# /etc/systemd/system/ollama.service.d/override.conf
2[Service]
3Environment="OLLAMA_HOST=10.0.0.5:11434"
Never 0.0.0.0. Bind to the exact address you need. Reload and
verify:
1systemctl daemon-reload
2systemctl restart ollama
3ss -tlnp | grep 11434
2. systemd unit hardening (NIS2: 21(2)(e); ISO 27001: A.8.25)
The installer’s default unit looks roughly like this:
1[Service]
2ExecStart=/usr/bin/ollama serve
3User=ollama
4Group=ollama
5Restart=always
6RestartSec=3
That is the floor. The hardened drop-in:
1# /etc/systemd/system/ollama.service.d/override.conf
2[Service]
3# Process isolation
4NoNewPrivileges=true
5PrivateTmp=true
6PrivateDevices=false # set to true if NOT using a GPU
7ProtectSystem=strict
8ProtectHome=true
9ProtectKernelTunables=true
10ProtectKernelModules=true
11ProtectKernelLogs=true
12ProtectControlGroups=true
13ProtectClock=true
14RestrictNamespaces=true
15RestrictSUIDSGID=true
16RestrictRealtime=true
17LockPersonality=true
18MemoryDenyWriteExecute=false # llama.cpp uses JIT in some paths
19# Filesystem
20ReadWritePaths=/usr/share/ollama
21# Capabilities
22AmbientCapabilities=
23CapabilityBoundingSet=
24# Syscall filter
25SystemCallFilter=@system-service
26SystemCallFilter=~@privileged @resources
27SystemCallErrorNumber=EPERM
28# Resource controls (size to your largest model + headroom)
29MemoryMax=48G
30TasksMax=4096
Two notes that bite people:
PrivateDevices=truebreaks GPU access. If you set it, CUDA cannot find/dev/nvidia*. Leavefalseon GPU hosts; settrueon CPU-only hosts.MemoryDenyWriteExecute=truebreaks the JIT. Some llama.cpp inference paths require W+X memory. Leavefalse.
Apply and verify:
1systemctl daemon-reload
2systemctl restart ollama
3systemctl show ollama -p NoNewPrivileges,ProtectSystem,PrivateTmp,MemoryMax
Expect each property to be set to the value in your override. The
systemd-analyze security ollama score is a useful before-and-after
metric — the default unit scores around 9.6 (“unsafe”); the
hardened unit drops to around 3.0 (“medium”) with the GPU caveats
above. CPU-only hosts can get below 2.0.
3. Reverse-proxy posture: TLS + auth + rate limits (NIS2: 21(2)(e),(h),(j); ISO 27001: A.5.14, A.8.5, A.8.24)
Ollama’s API has no authentication. The default ollama serve
accepts any caller on the bind interface and answers any prompt,
including /api/pull (download an arbitrary model) and
/api/delete (remove a local model). Authentication and TLS live
at the reverse proxy.
A minimal Caddy front-end on a separate hostname:
1ai.internal.example.com {
2 encode gzip
3
4 # Block administrative endpoints entirely from outside
5 @admin path /api/pull* /api/push* /api/delete* /api/create*
6 handle @admin {
7 respond 404
8 }
9
10 # Require basic auth on the rest
11 basic_auth /api/* {
12 ops $2a$14$REDACTED_BCRYPT_HASH
13 }
14
15 reverse_proxy 127.0.0.1:11434 {
16 header_up Host {host}
17 header_up X-Real-IP {remote_host}
18 }
19}
What that does:
pull,push,delete,createendpoints respond 404 to external callers regardless of credentials. Those operations stay on the host, via theollamaCLI run as the operator.- The remaining endpoints (
/api/generate,/api/chat,/api/embeddings,/api/tags) require HTTP basic auth over TLS. Caddy handles certificate provisioning automatically via Let’s Encrypt. - TLS to the model server itself stays plaintext on
127.0.0.1— the loopback hop is the trust boundary.
Generate the bcrypt hash with caddy hash-password. Rotate by
issuing a new hash and editing the basic_auth block; reload
Caddy. For multi-user production, replace basic auth with a
forward-auth handler against an identity provider or with a
LiteLLM proxy that does per-key accounting — covered in the
planned litellm-gateway-hardening piece, not here.
Rate limiting is plugin territory in Caddy; the simplest
production-acceptable answer is to put Ollama behind an Nginx
front-end with limit_req instead (see
nginx-ratelimit for the patterns).
4. GPU access scoping (ISO 27001: A.8.2, A.8.5)
The installer adds the ollama user to render and video,
which grants read/write access to /dev/nvidia0,
/dev/nvidiactl, and /dev/nvidia-uvm. Anyone else in those
groups inherits the same access.
Audit it:
1getent group video render | awk -F: '{print $1": "$4}'
The output should list service accounts only. If a human user is
in video or render, they can talk to the GPU directly,
bypassing Ollama and any rate limits at the proxy. For a host
dedicated to model serving, remove human accounts from those
groups:
1gpasswd -d alice video
2gpasswd -d alice render
For the systemd unit, the device allowances belong with the hardening drop-in:
1# Append to /etc/systemd/system/ollama.service.d/override.conf
2DevicePolicy=closed
3DeviceAllow=/dev/nvidia0 rw
4DeviceAllow=/dev/nvidiactl rw
5DeviceAllow=/dev/nvidia-uvm rw
6DeviceAllow=/dev/nvidia-uvm-tools rw
DevicePolicy=closed defaults to denying every device node; the
DeviceAllow lines open only what is needed. Adjust device
numbers if you have more than one GPU.
5. Model storage permissions and registry trust (GDPR: Article 32; ISO 27001: A.8.10, A.8.24)
Default location: /usr/share/ollama/.ollama/models. Owned by
ollama:ollama but the directory permissions are 755 — anyone
on the box can list the model store and read the blobs. Tighten:
1chown -R ollama:ollama /usr/share/ollama
2chmod 700 /usr/share/ollama
3chmod 700 /usr/share/ollama/.ollama
4find /usr/share/ollama -type d -exec chmod 700 {} +
5find /usr/share/ollama -type f -exec chmod 600 {} +
For a host that handles confidential prompts or runs fine-tuned
models that represent operator IP, mount the model store on a
LUKS-encrypted partition. The patterns from
encrypted-backups-restic
apply — encryption at rest is cheap insurance against a stolen
disk.
Registry trust. Ollama pulls models from registry.ollama.ai
by default and verifies the SHA-256 of every blob against the
manifest before saving. That handles integrity-in-transit. It
does not verify the model contents are safe — a publisher
account that gets compromised can ship a poisoned model that
downloads and runs cleanly. Two mitigations:
- Pin models by digest (
ollama pull llama3:8b@sha256:abc...) in any automation, so a re-pull after a registry incident does not silently update. - Mirror trusted models to an internal registry (set
OLLAMA_REGISTRYto your mirror’s URL). Same posture as pinning apipornpmmirror.
6. Logging and audit (NIS2: 21(2)(b); ISO 27001: A.8.15, A.8.16)
Ollama logs go to journald via the systemd unit:
1journalctl -u ollama --since "1 hour ago"
Two log streams worth thinking about:
- Service lifecycle. Start, stop, errors, model loads. Forward
to the central log target along with the rest of the host’s
syslog stream — same pipeline as the
log-minimisation-recipesguide describes. - Request stream. Every
/api/generateand/api/chatrequest includes the prompt. Ollama itself does not log prompts by default, but the reverse-proxy access log does (URL + method + status). Decide before you turn anything on whether prompts may legally be persisted — they often contain personal data or confidential business information. If they may not, configure the proxy to log without bodies and to drop the access log line on the/api/generateand/api/chatpaths, or to hash the request body before logging.
The privacy framing here is the same as the
privacy-by-design-server-build
guide: decide retention and redaction policy before you ship,
not after.
7. Updates and maintenance (NIS2: 21(2)(e); ISO 27001: A.8.8)
The official update path is to re-run the install script:
1curl -fsSL https://ollama.com/install.sh | sh
That is fine for a homelab. For production, pin the version and verify:
1# Pin to a specific release
2OLLAMA_VERSION=0.30.9
3curl -fsSL -o /tmp/ollama \
4 "https://github.com/ollama/ollama/releases/download/v${OLLAMA_VERSION}/ollama-linux-amd64"
5curl -fsSL -o /tmp/ollama.sha256 \
6 "https://github.com/ollama/ollama/releases/download/v${OLLAMA_VERSION}/ollama-linux-amd64.sha256"
7
8# Verify the checksum before installing
9( cd /tmp && sha256sum -c ollama.sha256 ) || { echo "checksum mismatch"; exit 1; }
10
11# Replace the binary
12systemctl stop ollama
13install -o root -g root -m 0755 /tmp/ollama /usr/bin/ollama
14systemctl start ollama
Wrap that in a small script that records the version and date in
the host’s change log. The same unattended-upgrades cadence
that handles OS packages does not handle Ollama — a manual
update cycle, scheduled, is the right answer.
Before updating: snapshot the host if you can, and ollama list
ollama show <model>to record the models you have so a roll forward (or back) is reproducible.
Verification — the smoke-test sequence
After every change to the unit, the proxy config, or an update, run the same checks:
1# Service active and bound to the intended interface only
2systemctl is-active ollama
3ss -tlnp | grep 11434
4
5# Hardening flags actually applied
6systemctl show ollama -p \
7 NoNewPrivileges,ProtectSystem,ProtectHome,PrivateTmp,MemoryMax
8
9# Security score (before-and-after metric)
10systemd-analyze security ollama | head -5
11
12# Model store permissions tight
13stat -c '%U:%G %a' /usr/share/ollama /usr/share/ollama/.ollama
14
15# GPU device controls in effect (skip on CPU-only hosts)
16systemctl show ollama -p DevicePolicy,DeviceAllow
17
18# Reverse proxy returns 401 without credentials
19curl -sI https://ai.internal.example.com/api/tags | head -1
20
21# Reverse proxy blocks admin endpoints with 404
22curl -sI -u ops:pass https://ai.internal.example.com/api/pull | head -1
23
24# Authenticated read endpoint works
25curl -s -u ops:pass https://ai.internal.example.com/api/tags \
26 | jq '.models | length'
Three minutes end-to-end. They catch every regression I have hit on this stack so far.
What this guide deliberately does not cover
- LiteLLM, OpenAI-compatible gateways, per-key quotas. That
is the planned
litellm-gateway-hardeningpiece — Ollama is the model serving layer, LiteLLM is the policy and accounting layer in front of it. They harden separately. - Prompt-injection defences and RAG-layer data security.
Covered by
agentic-ai-pipeline-reference. - Vector databases. Planned
qdrant-hardeningandpgvector-on-postgrespieces. - Multi-node serving / load balancing. Ollama is single-node
by design; the planned
vllm-production-baselinepiece covers the multi-node case with vLLM, where the operational shape is meaningfully different. - AMD ROCm and Intel Gaudi GPU stacks. NVIDIA-only here. The systemd device allowances follow the same pattern but the device nodes differ.
- Fine-tuning workflows and model hosting. Out of scope.