Tested on: Hugo 0.147.0 (extended), Cloudflare Pages (Free plan), 23 May 2026. The standards in this space are moving fast — re-check the IANA Link Relations registry and the IETF Content-Signal draft before copying anything verbatim into a production site.
The phrase “agent-ready” covers a moving target. Twelve months ago
it meant “have a sitemap.” Today it touches RFC 8288 Link headers,
the IETF Content-Signal draft, IANA registered relations, content
negotiation for text/markdown, llms.txt, AI sitemaps, and
/.well-known/agent-skills/. Most of those are drafts or vendor
experiments. Some of them are stable enough to ship now. Others are
worth watching but not worth wiring up yet.
This is what I shipped for stackharden.com (a Hugo site on
Cloudflare Pages), what I deliberately did not ship, and the
reasoning for each call. Where a recommendation is generic and does
not fit a content site, I have ignored it rather than padded the
implementation with rels pointing at endpoints that don’t exist.
Terms used in this guide
A few abbreviations come up repeatedly. Defined here so the rest of the guide can stay terse.
- RFC — Request for Comments. The numbered specification
documents that define Internet standards (e.g. RFC 8288 defines
the HTTP
Linkheader). Published by the IETF. - IETF — Internet Engineering Task Force. The standards body that develops and publishes RFCs covering most of the protocols the web runs on.
- IANA — Internet Assigned Numbers Authority. The registry
that maintains, among other things, the official list of HTTP
link relation types — the values you can put in a
rel="..."parameter and expect a compliant parser to recognise. - LLM — Large Language Model. The class of AI system (ChatGPT, Claude, Gemini, etc.) that may consume the site programmatically — either at training time (in bulk) or at query time (per request).
- RAG — Retrieval-Augmented Generation. The pattern where an
LLM, instead of answering only from its training, fetches
relevant external content at query time and incorporates it into
the answer. RAG-style consumption is what the
ai-inputcontent signal speaks to. - rel — short for relation type. The
relparameter on aLinkheader (or HTML<link>element) declares what the linked resource is to the current one (e.g.rel="sitemap",rel="author"). - RSS — Really Simple Syndication. The XML feed format Hugo
publishes at
/index.xml, used by readers and aggregators that poll for new content.
Why this matters, briefly
LLMs are increasingly intermediating discovery. A reader looking for
“how do I configure NIS2-compliant logging on PostgreSQL” may never
reach Google — they ask Claude, ChatGPT, Gemini, or whatever else
their organisation has standardised on. If those tools can identify,
ingest, and correctly attribute your content, you are in the answer.
If they cannot, you are not. The signals that determine which side of
the line you land on are increasingly explicit, and they live in
HTTP headers and robots.txt rather than in your HTML.
That is the upside. The downside is that “agent-ready” checklists proliferate faster than the standards under them stabilise. Spending an afternoon implementing every recommendation a tester throws at you is a sure way to ship code you’ll be ripping out in a quarter. The question is not “can I score 100%?” — it is “which of these signals are durable enough to be worth maintaining?”
RFC 8288 Link headers
What it is. RFC 8288 defines the Link HTTP response header.
Each Link value carries a URL and one or more rel parameters that
declare the relationship between the response and the linked
resource. Browsers ignore most of them. Agents and crawlers
increasingly use them as a low-cost discovery surface — strictly
cheaper than parsing HTML to look for <link> tags in the document
head.
What to ship. Cloudflare Pages reads a _headers file at the
root of the static output directory (Hugo copies static/_headers
to the build output verbatim). The format is one URL pattern per
block, with indented header lines underneath. Multiple Link lines
on a single pattern get merged into one comma-separated header on
the wire — that’s valid per RFC 8288 §3 and parsers handle it
identically.
The set I shipped:
/*
Link: </sitemap.xml>; rel="sitemap"
Link: </index.xml>; rel="alternate"; type="application/rss+xml"; title="StackHarden RSS feed"
Link: </about/>; rel="author"
Link: </privacy/>; rel="privacy-policy"
Link: </disclaimer/>; rel="license"
Link: </contribute/>; rel="help"
Six rels — all of them registered with IANA, all of them pointing at
URLs that exist on the live site. Critically, no service-doc and no
api-catalog. RFC 9727 §3 defines those for sites that publish API
documentation or an API catalog. This site does neither. Adding them
and pointing at /docs or similar would be padding for the checker
and a lie to anyone actually following the link. If a recommendation
tells you to include a rel and the target does not exist on your
site, ignore the recommendation.
How to verify on the wire.
1curl -sI https://stackharden.com/ | grep -i '^link:'
You should see one Link: line with all six entries comma-separated,
or (depending on the upstream) six separate Link: lines. Either is
correct. To break it out cleanly for inspection:
1curl -sI https://stackharden.com/ | grep -o 'rel="[^"]*"' | sort -u
That returns one line per distinct rel. Six unique values means the headers are intact.
Gotchas.
_headersis evaluated at the Cloudflare edge, not by Hugo. The Hugo dev server (hugo server) does not apply it — verification has to happen against the live deploy.- Malformed entries are logged in the Cloudflare Pages build output
and silently skipped. The site does not break; the rules just do
not apply. Check the deploy log if a
curldoes not surface what you expected. - The
/*pattern matches every path including assets. That is intentional here — there is no downside to a.cssresponse carrying these Link headers, and the alternative (only/) means an agent landing on a guide does not see them.
Content-Signal in robots.txt
What it is. A directive in robots.txt declaring the site’s
position on three discrete consumption modes:
search— search engines building indexes and returning hyperlinks / short excerpts.ai-input— AI systems using the content as live input for retrieval-augmented generation, real-time grounding, AI search summaries.ai-train— using the content to train or fine-tune AI models.
The draft is still in IETF flow, but the major search and AI
operators have publicly stated they will honour it. Cloudflare
injects a managed version of this directive automatically when “AI
Crawl Control” is enabled in the dashboard. The injection is fine
but it is opaque — it sits inside # BEGIN Cloudflare Managed content markers and you don’t control it.
I chose to own the declaration in the site’s own robots.txt
template. That makes the signal stable regardless of Cloudflare
edge-level behaviour, regardless of plan tier, and regardless of
whoever later touches the AI Crawl Control settings.
What to ship. Override the theme’s robots.txt template by
creating layouts/robots.txt in the Hugo project root (no extension
gymnastics — Hugo looks for it there). PaperMod’s upstream template
becomes:
User-agent: *
Content-Signal: search=yes, ai-input=yes, ai-train=no
{{- if hugo.IsProduction | or (eq site.Params.env "production") }}
Disallow:
{{- else }}
Disallow: /
{{- end }}
Sitemap: {{ "sitemap.xml" | absURL }}
The position I took:
search=yes— we want to be found. Obvious.ai-input=yes— when someone asks an AI assistant about a topic we cover well, we want our content cited in the answer. The cost is loss of direct traffic for those queries; the benefit is wider exposure and citation as the authoritative source. For a content site whose differentiator is the writing, this trades short-term pageviews for long-term authority. The math swings the other way for a site monetised purely on ads or one offering paywalled primary research.ai-train=no— we do not consent to wholesale ingestion for model training. Distinct fromai-input: training is bulk ingestion typically without per-request attribution;ai-inputis per-query lookup typically with a citable source.
The three signals are independent. Allow search but disallow ai-input is a coherent stance for a paywalled site. Allow ai-input but disallow ai-train is the standard stance for a content site that wants discovery but not bulk consumption.
How to verify.
1curl -s https://stackharden.com/robots.txt
Content-Signal: should appear on its own line inside the default
User-agent: * block, before the Disallow: directive. If
Cloudflare’s AI Crawl Control is also enabled, you may see a second,
Cloudflare-injected block with # BEGIN Cloudflare Managed content
markers carrying its own signals and a list of per-bot Disallow
directives. That layering is fine — our explicit ai-train=no is
the higher-level statement; Cloudflare’s per-bot list is
belt-and-braces for named crawlers.
Gotchas.
- The IETF draft is not yet an RFC. The exact directive name and
syntax may shift before stabilisation. Watch the draft and be
prepared to rename
Content-Signalif the IETF flow lands on a different keyword. - A signal like
ai-input=nois a declaration, not enforcement. An LLM operator that ignores the directive can still scrape and serve. The signal is consent and audit evidence — useful if a dispute ends up legal — not a hard block. For hard blocks, look at IP-level blocking, Cloudflare’s bot-fight settings, orDisallow: /blocks for named user agents.
Markdown for Agents
What it is. Content negotiation. When a client sends Accept: text/markdown on a request, the server responds with a Markdown
representation of the same content rather than the HTML. Browsers
keep getting HTML; agents that want lower-token-cost ingestion get
Markdown.
Cloudflare offers this as a managed feature: zone → AI Crawl Control → Markdown for Agents toggle. The conversion happens at the edge — no origin changes required.
Why I did not ship it.
Cloudflare gates the feature to Pro, Business, and Enterprise plans (~€20/month minimum for Pro). On a Free-plan zone the toggle simply is not present. For a small content site that does not yet have a revenue model, that is a real cost decision, not a trivial upgrade.
The DIY alternative is a Hugo output format (Hugo can emit a .md
variant of every page at build time) combined with a Cloudflare
Pages Function (functions/_middleware.js) that intercepts
requests, checks the Accept header, and rewrites to the .md
variant when the client opts in. It works. It is also two surfaces
of code that will need maintenance as shortcodes evolve — the
markdown output for a page that uses {{< compliance >}} either
contains raw Hugo syntax (which is wrong) or requires a parallel
markdown renderer for each shortcode (which is real engineering).
For a content site whose audience is human practitioners reading
HTML in a browser, the value of content-negotiated markdown is
narrow. An LLM crawling the site for retrieval will parse the HTML
fine — <pre><code> blocks, headings, lists, and the Link headers
we just shipped are enough. The marginal benefit of serving
markdown is reduced token cost at ingestion time, which the
crawler bears, not us.
The honest answer: skipped, because the value-to-effort ratio is low. If you are on Pro+ already, flip the toggle and verify with:
1curl -sI -H 'Accept: text/markdown' https://example.com/some/page/ \
2 | grep -i 'content-type\|x-markdown'
A Content-Type: text/markdown confirms it landed. If you are on
Free, this is the most defensible omission on the list.
Verification commands, all in one place
For the working set above, the smoke-test sequence is:
1# Link headers — six rels expected
2curl -sI https://your-site/ | grep -o 'rel="[^"]*"' | sort -u
3
4# Content-Signal — single directive in default user-agent block
5curl -s https://your-site/robots.txt | grep -i '^content-signal:'
6
7# Sitemap — referenced from robots.txt and reachable
8curl -sI https://your-site/sitemap.xml | head -1
9
10# RSS feed — alternate representation
11curl -sI https://your-site/index.xml | head -1
Run those four after every deploy that touches headers, robots, or sitemap config. They take under a second and they catch every breakage I have seen so far.
What this guide deliberately does not cover
There are several emerging signals worth being aware of but not yet worth shipping on this site:
llms.txt. A proposed convention for a top-level file listing content an LLM should be aware of, with links to canonical Markdown representations. Adoption is mixed; the format is in flux. Worth watching, not yet worth maintaining.- AI sitemaps /
sitemap-llms.xml. Distinct from the standard XML sitemap; carries summaries, semantic tags, and intended consumption modes. Cloudflare and some indexing services are experimenting; no stable standard yet. /.well-known/agent-skills/. A folder of agent-discoverable skill descriptors. Useful if the site exposes an API or programmatic surface; misleading on a pure content site.- JSON-LD per-page beyond what PaperMod emits. Could expand the
Articleschema withaccountablePerson,dateModified,inLanguage, formallicenseURLs. Worthwhile if you decide to invest in structured data; deferred here pending a decision on whether to do so consistently across the existing archive. X-Robots-TagHTTP headers. Useful when you need per-response indexing directives that are independent ofrobots.txt. Not currently needed; existingrobots.txtdeclarations cover the same ground.
The volume of “agent-readiness” recommendations will keep growing; the ones that survive the next year are the ones to invest in.
Closing
A reasonable position for a small content site, after a week of running checks: ship the three things in this guide, watch the draft RFCs for the bullets above, revisit in six months. The site that does that is in a better position than the site that scored 100% against a checker that itself will not exist in twelve.