Log Minimisation Recipes — Nginx, Apache, PostgreSQL, Applications

Tested on: Ubuntu 24.04 LTS, Nginx 1.26.x, Apache 2.4.x, PostgreSQL 16.x, Python 3.12. The principles are language- and stack-independent; the recipes are concrete examples.

Why this matters

Default logging on every component in a typical stack captures more personal data than the operator usually realises. A standard Nginx access log line:

192.0.2.42 - - [17/May/2026:14:01:12 +0000] "GET /search?q=password+reset+for+alice%40example.com HTTP/2.0" 200 4231 "https://example.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_4) AppleWebKit/605.1.15..."

…retains, for as long as the log is kept:

The visitor’s full IPv4 address (personal data under GDPR; CJEU C-582/14, Breyer).
The exact request — including any query-string content, which can contain email addresses, tokens, or other identifying data.
A User-Agent string detailed enough to fingerprint the browser, OS patch level, and often the specific user.
The referrer URL, which may itself contain personal data.

Multiply by every request, every component (web server, application server, database, auth daemon), and the implicit “we collect minimal data” claim in your privacy notice quietly stops being true.

This guide is the recipe collection for fixing that. For the decision level — what logs to keep at all, for how long — see privacy-by-design-server-build. This guide assumes those decisions are made and turns them into configuration.

1. Decide before you configure

Three questions for every log stream, before you change anything:

What incident or class of bug would this log help diagnose? If the answer is “I’m not sure”, default to not logging it.
What identifying data does the default format include? Audit first; the surprise is usually downward (more PII than expected).
Could a hashed or masked identifier serve the same investigation? Almost always yes for analytics-style questions, sometimes for debugging, rarely for security incident response.

The recipes below assume you have answered these and reached the common position: keep operational signal, drop the identifying details.

2. Nginx — privacy-preserving access log

The default combined log_format captures everything described above. Replace with a deliberate format in your http {} block:

 1# Mask the last IPv4 octet (192.0.2.42 → 192.0.2.0) and the last 64
 2# bits of an IPv6 address. Combined with daily log rotation, this
 3# retains aggregate analytics utility while removing fine-grained
 4# individual identifiability.
 5map $remote_addr $ip_masked {
 6    "~^(?<a>\d+\.\d+\.\d+)\.\d+$"           "$a.0";
 7    "~^(?<a>[0-9a-fA-F:]+):[0-9a-fA-F:]+$"  "$a::";
 8    default                                  "0.0.0.0";
 9}
10
11log_format minimal '$ip_masked - [$time_iso8601] '
12                   '"$request_method $uri $server_protocol" '
13                   '$status $body_bytes_sent $request_time';
14
15# Apply per server block, or globally:
16access_log /var/log/nginx/access.log minimal;
17
18# Error log retains full IP — necessary for diagnosis of attack patterns.
19# Trade-off: error log is far smaller; its retention should be shorter.
20error_log /var/log/nginx/error.log warn;

Three things deliberately removed from this format compared to combined:

$query_string / full URI with query. $uri excludes the query string. If you need it for specific debugging, log it on those endpoints only with if ($arg_debug = "1") { access_log ... ; }.
$http_user_agent. Useful for browser compatibility analytics — but those questions are better answered by analytics tools (like Plausible) than by post-hoc log analysis. The User-Agent is one of the strongest passive fingerprints.
$http_referer. Referrer URLs can contain personal data that the referring site embedded; retaining them is taking on someone else’s data minimisation problem.

If you genuinely need User-Agent for diagnosing a specific issue, add a dated temporary log_format for that case and remove it when done. Permanent collection is the trap to avoid.

3. Apache — equivalent recipe

Same idea, different syntax:

 1# In /etc/apache2/apache2.conf or a conf-enabled file:
 2
 3# Mask the last IPv4 octet via a custom log format.
 4LogFormat "%{X-Anonymised-IP}e - [%{%Y-%m-%dT%H:%M:%S%z}t] \"%m %U %H\" %>s %B %D" minimal
 5
 6# Anonymise IPv4 at request time via SetEnvIf:
 7SetEnvIfNoCase Remote_Addr "^(\d+\.\d+\.\d+)\.\d+$"       X-Anonymised-IP=$1.0
 8SetEnvIfNoCase Remote_Addr "^([0-9a-fA-F:]+):[0-9a-fA-F:]+$" X-Anonymised-IP=$1::
 9
10CustomLog ${APACHE_LOG_DIR}/access.log minimal
11ErrorLog  ${APACHE_LOG_DIR}/error.log
12LogLevel warn

%D is request duration in microseconds — useful for performance debugging without identifying anyone. %B is bytes sent (excluding headers), the safe variant of the bytes-counted family.

4. PostgreSQL — what to log and what not to

Postgres logs are a recurring privacy surprise. With permissive defaults (log_statement = all), every query and every parameter value gets written to disk — including SQL literals containing personal data.

The recommended baseline (also covered in the postgresql-hardening guide) for postgresql.conf:

log_connections = on            # Auth-event signal — important for security.
log_disconnections = on
log_hostname = off              # Avoid reverse DNS — adds latency and PII.
log_line_prefix = '%m [%p] %q%u@%d '

# DDL only by default. Switch to 'mod' or 'all' deliberately and only
# on diagnostic windows — never as a permanent setting.
log_statement = 'ddl'

# Slow-query log lets you find performance problems without recording
# every successful query.
log_min_duration_statement = 500     # ms

# Statements that error get logged regardless — that's where the
# debugging signal usually lives anyway.
log_min_error_statement = error

# Do NOT enable log_parameters / log_parameter_max_length values
# in production — those write the literal parameter values, which is
# precisely the personal data you are trying not to retain.

For an active investigation you can temporarily raise the verbosity on a single role:

1ALTER ROLE diagnostic_user SET log_statement = 'all';
2-- ... investigate ...
3ALTER ROLE diagnostic_user RESET log_statement;

…and remember to revert. The role-scoped variant ensures other clients are not retroactively pulled into the diagnostic window.

5. Application frameworks

The base reflex at every framework level is the same: log identifiers, not identities. Specific patterns:

Python (Flask, FastAPI, Django)

 1import logging
 2import structlog
 3
 4logger = structlog.get_logger()
 5
 6@app.post("/login")
 7def login():
 8    data = request.json
 9    user = User.query.filter_by(email=data["email"]).first()
10    if not user or not user.check_password(data["password"]):
11        # Log the attempt without the email — counters brute force,
12        # doesn't itself become a list of valid email addresses.
13        logger.warning("login.failed", reason="bad_credentials")
14        return {"error": "invalid"}, 401
15
16    logger.info("login.succeeded", user_id=user.id)
17    return {"token": issue_token(user)}

Structured logging (structlog, python-json-logger) makes minimisation explicit: you list the keys you intend to log; nothing else gets in by accident. With f-strings into the message, it is easy to forget what’s in the format string.

Node (Express, Fastify)

pino is the most ergonomic; structured by default. The pattern is identical: log identifiers, never user objects in their entirety.

 1import pino from 'pino';
 2const logger = pino({ redact: ['*.password', '*.token', 'req.headers.cookie'] });
 3
 4app.post('/login', async (req, res) => {
 5    const user = await users.findByEmail(req.body.email);
 6    if (!user || !await user.checkPassword(req.body.password)) {
 7        logger.warn({ event: 'login.failed' });
 8        return res.status(401).json({ error: 'invalid' });
 9    }
10    logger.info({ event: 'login.succeeded', userId: user.id });
11    return res.json({ token: issueToken(user) });
12});

pino’s redact option is the safety net for the inevitable case where someone logs the full request object — sensitive fields get replaced with [REDACTED] automatically.

Generic principles (any language)

Never put user-provided strings directly into a log message format. Use structured fields. Format-string logs make later redaction painful.
Hash before logging for analytics. A daily-rotating salt + hash of an identifier gives you “unique users per day” without retaining identity overnight.
Authentication events are different. Failed-auth logs need some signal — IP and timestamp at minimum — to support brute-force detection. The right answer is usually: log the IP (or a hash of it), log the timestamp, log the username being attempted only on successful auth. Don’t log the password attempt itself, ever.
Tracing systems retain everything. OpenTelemetry / Jaeger spans often capture HTTP headers, query strings, and parameters by default. Apply the same minimisation discipline to your tracing config.

6. systemd journal and auth logs

The systemd journal aggregates logs from every unit on the host. It typically contains:

Every sudo invocation, with the command line.
Every SSH connection attempt, successful and failed.
Service startup / shutdown events.

Most of this is security-relevant — you want to keep it. The configuration concern is retention, not content:

# /etc/systemd/journald.conf
[Journal]
Storage=persistent
SystemMaxUse=500M
SystemMaxFileSize=50M
MaxFileSec=1week
MaxRetentionSec=90day        # Auth-event retention.

MaxRetentionSec is the lever. Ninety days is a defensible default for auth events; longer if your incident-response window is longer, shorter only if you have a hot-backup of these events elsewhere.

Forward auth.log and journalctl output to a separate host. Local audit logs on a compromised host are evidence the attacker controls.

7. Putting it together — rotation and retention

Configuration only delivers value when paired with the right rotation and retention. Common-case logrotate for Nginx:

# /etc/logrotate.d/nginx (Ubuntu default — review the retention number)
/var/log/nginx/*.log {
    daily
    rotate 30                    # 30 days; align with privacy notice.
    missingok
    compress
    delaycompress
    notifempty
    create 0640 www-data adm
    sharedscripts
    postrotate
        if invoke-rc.d nginx status > /dev/null 2>&1; then
            invoke-rc.d nginx rotate >/dev/null 2>&1
        fi
    endscript
}

The retention number must match what your privacy notice says. If the notice promises “we retain access logs for 30 days”, rotate 90 contradicts that — and backups of the log volume effectively extend the retention even further. Cross-check.

Gotchas

`tail -f` and `journalctl` reveal what minimisation hid

A log_format that drops User-Agent on disk does nothing to limit what is visible to an operator running tail -f on a live log stream. The minimisation step happens at write time; live tailing sees whatever the writer emits. If you log identifying data at all (even for short-term diagnostic windows), assume operators see it.

Forwarding to SaaS providers re-introduces what you removed

Sending logs to a centralised SaaS (Datadog, Splunk, ELK-as-a-service) puts the minimisation question at their end too. If your forwarded logs contain anything beyond what your privacy notice describes, the forward is itself a controller decision.

Backups extend log retention

A 30-day on-disk retention plus 90-day backups means the effective retention period is 90 days. Either shorten the backup retention or update the privacy notice to match.

Sampling instead of minimising

A common reflex: “we log 100% of requests but with privacy in mind”. The harder question is whether you need 100% of requests at all. For a content site, sampled access logs (one in a hundred, retained fully) can give you debugging signal with two orders of magnitude less personal data on disk.

What this guide deliberately does not cover

Centralised log aggregation (Loki, ELK, Datadog) — same minimisation principles, different configuration; a separate guide.
Application-specific debugging frameworks (Sentry, Honeybadger) — they have their own scrubbing rules; configure those rules with the same discipline.
Audit logging for compliance evidence — separate purpose, separate retention; do not collapse “audit log” and “operational log” into one stream.
Email server logs (Postfix, Dovecot) — covered separately in the planned email-server-hardening guide.

Why this matters#

1. Decide before you configure#

2. Nginx — privacy-preserving access log#

3. Apache — equivalent recipe#

4. PostgreSQL — what to log and what not to#

5. Application frameworks#

Python (Flask, FastAPI, Django)#

Node (Express, Fastify)#

Generic principles (any language)#

6. systemd journal and auth logs#

7. Putting it together — rotation and retention#

Gotchas#

tail -f and journalctl reveal what minimisation hid#

Forwarding to SaaS providers re-introduces what you removed#

Backups extend log retention#

Sampling instead of minimising#

What this guide deliberately does not cover#