Tested on: Hugo 0.147.0 (extended), Cloudflare Pages (Free plan), 23 May 2026. The standards in this space are moving fast — re-check the IANA Link Relations registry and the IETF Content-Signal draft before copying anything verbatim into a production site.

The phrase “agent-ready” covers a moving target. Twelve months ago it meant “have a sitemap.” Today it touches RFC 8288 Link headers, the IETF Content-Signal draft, IANA registered relations, content negotiation for text/markdown, llms.txt, AI sitemaps, and /.well-known/agent-skills/. Most of those are drafts or vendor experiments. Some of them are stable enough to ship now. Others are worth watching but not worth wiring up yet.

This is what I shipped for stackharden.com (a Hugo site on Cloudflare Pages), what I deliberately did not ship, and the reasoning for each call. Where a recommendation is generic and does not fit a content site, I have ignored it rather than padded the implementation with rels pointing at endpoints that don’t exist.

Terms used in this guide

A few abbreviations come up repeatedly. Defined here so the rest of the guide can stay terse.

  • RFCRequest for Comments. The numbered specification documents that define Internet standards (e.g. RFC 8288 defines the HTTP Link header). Published by the IETF.
  • IETFInternet Engineering Task Force. The standards body that develops and publishes RFCs covering most of the protocols the web runs on.
  • IANAInternet Assigned Numbers Authority. The registry that maintains, among other things, the official list of HTTP link relation types — the values you can put in a rel="..." parameter and expect a compliant parser to recognise.
  • LLMLarge Language Model. The class of AI system (ChatGPT, Claude, Gemini, etc.) that may consume the site programmatically — either at training time (in bulk) or at query time (per request).
  • RAGRetrieval-Augmented Generation. The pattern where an LLM, instead of answering only from its training, fetches relevant external content at query time and incorporates it into the answer. RAG-style consumption is what the ai-input content signal speaks to.
  • rel — short for relation type. The rel parameter on a Link header (or HTML <link> element) declares what the linked resource is to the current one (e.g. rel="sitemap", rel="author").
  • RSSReally Simple Syndication. The XML feed format Hugo publishes at /index.xml, used by readers and aggregators that poll for new content.

Why this matters, briefly

LLMs are increasingly intermediating discovery. A reader looking for “how do I configure NIS2-compliant logging on PostgreSQL” may never reach Google — they ask Claude, ChatGPT, Gemini, or whatever else their organisation has standardised on. If those tools can identify, ingest, and correctly attribute your content, you are in the answer. If they cannot, you are not. The signals that determine which side of the line you land on are increasingly explicit, and they live in HTTP headers and robots.txt rather than in your HTML.

That is the upside. The downside is that “agent-ready” checklists proliferate faster than the standards under them stabilise. Spending an afternoon implementing every recommendation a tester throws at you is a sure way to ship code you’ll be ripping out in a quarter. The question is not “can I score 100%?” — it is “which of these signals are durable enough to be worth maintaining?”

What it is. RFC 8288 defines the Link HTTP response header. Each Link value carries a URL and one or more rel parameters that declare the relationship between the response and the linked resource. Browsers ignore most of them. Agents and crawlers increasingly use them as a low-cost discovery surface — strictly cheaper than parsing HTML to look for <link> tags in the document head.

What to ship. Cloudflare Pages reads a _headers file at the root of the static output directory (Hugo copies static/_headers to the build output verbatim). The format is one URL pattern per block, with indented header lines underneath. Multiple Link lines on a single pattern get merged into one comma-separated header on the wire — that’s valid per RFC 8288 §3 and parsers handle it identically.

The set I shipped:

/*
  Link: </sitemap.xml>; rel="sitemap"
  Link: </index.xml>; rel="alternate"; type="application/rss+xml"; title="StackHarden RSS feed"
  Link: </about/>; rel="author"
  Link: </privacy/>; rel="privacy-policy"
  Link: </disclaimer/>; rel="license"
  Link: </contribute/>; rel="help"

Six rels — all of them registered with IANA, all of them pointing at URLs that exist on the live site. Critically, no service-doc and no api-catalog. RFC 9727 §3 defines those for sites that publish API documentation or an API catalog. This site does neither. Adding them and pointing at /docs or similar would be padding for the checker and a lie to anyone actually following the link. If a recommendation tells you to include a rel and the target does not exist on your site, ignore the recommendation.

How to verify on the wire.

1curl -sI https://stackharden.com/ | grep -i '^link:'

You should see one Link: line with all six entries comma-separated, or (depending on the upstream) six separate Link: lines. Either is correct. To break it out cleanly for inspection:

1curl -sI https://stackharden.com/ | grep -o 'rel="[^"]*"' | sort -u

That returns one line per distinct rel. Six unique values means the headers are intact.

Gotchas.

  • _headers is evaluated at the Cloudflare edge, not by Hugo. The Hugo dev server (hugo server) does not apply it — verification has to happen against the live deploy.
  • Malformed entries are logged in the Cloudflare Pages build output and silently skipped. The site does not break; the rules just do not apply. Check the deploy log if a curl does not surface what you expected.
  • The /* pattern matches every path including assets. That is intentional here — there is no downside to a .css response carrying these Link headers, and the alternative (only /) means an agent landing on a guide does not see them.

Content-Signal in robots.txt

What it is. A directive in robots.txt declaring the site’s position on three discrete consumption modes:

  • search — search engines building indexes and returning hyperlinks / short excerpts.
  • ai-input — AI systems using the content as live input for retrieval-augmented generation, real-time grounding, AI search summaries.
  • ai-train — using the content to train or fine-tune AI models.

The draft is still in IETF flow, but the major search and AI operators have publicly stated they will honour it. Cloudflare injects a managed version of this directive automatically when “AI Crawl Control” is enabled in the dashboard. The injection is fine but it is opaque — it sits inside # BEGIN Cloudflare Managed content markers and you don’t control it.

I chose to own the declaration in the site’s own robots.txt template. That makes the signal stable regardless of Cloudflare edge-level behaviour, regardless of plan tier, and regardless of whoever later touches the AI Crawl Control settings.

What to ship. Override the theme’s robots.txt template by creating layouts/robots.txt in the Hugo project root (no extension gymnastics — Hugo looks for it there). PaperMod’s upstream template becomes:

User-agent: *
Content-Signal: search=yes, ai-input=yes, ai-train=no
{{- if hugo.IsProduction | or (eq site.Params.env "production") }}
Disallow:
{{- else }}
Disallow: /
{{- end }}

Sitemap: {{ "sitemap.xml" | absURL }}

The position I took:

  • search=yes — we want to be found. Obvious.
  • ai-input=yes — when someone asks an AI assistant about a topic we cover well, we want our content cited in the answer. The cost is loss of direct traffic for those queries; the benefit is wider exposure and citation as the authoritative source. For a content site whose differentiator is the writing, this trades short-term pageviews for long-term authority. The math swings the other way for a site monetised purely on ads or one offering paywalled primary research.
  • ai-train=no — we do not consent to wholesale ingestion for model training. Distinct from ai-input: training is bulk ingestion typically without per-request attribution; ai-input is per-query lookup typically with a citable source.

The three signals are independent. Allow search but disallow ai-input is a coherent stance for a paywalled site. Allow ai-input but disallow ai-train is the standard stance for a content site that wants discovery but not bulk consumption.

How to verify.

1curl -s https://stackharden.com/robots.txt

Content-Signal: should appear on its own line inside the default User-agent: * block, before the Disallow: directive. If Cloudflare’s AI Crawl Control is also enabled, you may see a second, Cloudflare-injected block with # BEGIN Cloudflare Managed content markers carrying its own signals and a list of per-bot Disallow directives. That layering is fine — our explicit ai-train=no is the higher-level statement; Cloudflare’s per-bot list is belt-and-braces for named crawlers.

Gotchas.

  • The IETF draft is not yet an RFC. The exact directive name and syntax may shift before stabilisation. Watch the draft and be prepared to rename Content-Signal if the IETF flow lands on a different keyword.
  • A signal like ai-input=no is a declaration, not enforcement. An LLM operator that ignores the directive can still scrape and serve. The signal is consent and audit evidence — useful if a dispute ends up legal — not a hard block. For hard blocks, look at IP-level blocking, Cloudflare’s bot-fight settings, or Disallow: / blocks for named user agents.

Markdown for Agents

What it is. Content negotiation. When a client sends Accept: text/markdown on a request, the server responds with a Markdown representation of the same content rather than the HTML. Browsers keep getting HTML; agents that want lower-token-cost ingestion get Markdown.

Cloudflare offers this as a managed feature: zone → AI Crawl ControlMarkdown for Agents toggle. The conversion happens at the edge — no origin changes required.

Why I did not ship it.

Cloudflare gates the feature to Pro, Business, and Enterprise plans (~€20/month minimum for Pro). On a Free-plan zone the toggle simply is not present. For a small content site that does not yet have a revenue model, that is a real cost decision, not a trivial upgrade.

The DIY alternative is a Hugo output format (Hugo can emit a .md variant of every page at build time) combined with a Cloudflare Pages Function (functions/_middleware.js) that intercepts requests, checks the Accept header, and rewrites to the .md variant when the client opts in. It works. It is also two surfaces of code that will need maintenance as shortcodes evolve — the markdown output for a page that uses {{< compliance >}} either contains raw Hugo syntax (which is wrong) or requires a parallel markdown renderer for each shortcode (which is real engineering).

For a content site whose audience is human practitioners reading HTML in a browser, the value of content-negotiated markdown is narrow. An LLM crawling the site for retrieval will parse the HTML fine — <pre><code> blocks, headings, lists, and the Link headers we just shipped are enough. The marginal benefit of serving markdown is reduced token cost at ingestion time, which the crawler bears, not us.

The honest answer: skipped, because the value-to-effort ratio is low. If you are on Pro+ already, flip the toggle and verify with:

1curl -sI -H 'Accept: text/markdown' https://example.com/some/page/ \
2  | grep -i 'content-type\|x-markdown'

A Content-Type: text/markdown confirms it landed. If you are on Free, this is the most defensible omission on the list.

Verification commands, all in one place

For the working set above, the smoke-test sequence is:

 1# Link headers — six rels expected
 2curl -sI https://your-site/ | grep -o 'rel="[^"]*"' | sort -u
 3
 4# Content-Signal — single directive in default user-agent block
 5curl -s https://your-site/robots.txt | grep -i '^content-signal:'
 6
 7# Sitemap — referenced from robots.txt and reachable
 8curl -sI https://your-site/sitemap.xml | head -1
 9
10# RSS feed — alternate representation
11curl -sI https://your-site/index.xml | head -1

Run those four after every deploy that touches headers, robots, or sitemap config. They take under a second and they catch every breakage I have seen so far.

What this guide deliberately does not cover

There are several emerging signals worth being aware of but not yet worth shipping on this site:

  • llms.txt. A proposed convention for a top-level file listing content an LLM should be aware of, with links to canonical Markdown representations. Adoption is mixed; the format is in flux. Worth watching, not yet worth maintaining.
  • AI sitemaps / sitemap-llms.xml. Distinct from the standard XML sitemap; carries summaries, semantic tags, and intended consumption modes. Cloudflare and some indexing services are experimenting; no stable standard yet.
  • /.well-known/agent-skills/. A folder of agent-discoverable skill descriptors. Useful if the site exposes an API or programmatic surface; misleading on a pure content site.
  • JSON-LD per-page beyond what PaperMod emits. Could expand the Article schema with accountablePerson, dateModified, inLanguage, formal license URLs. Worthwhile if you decide to invest in structured data; deferred here pending a decision on whether to do so consistently across the existing archive.
  • X-Robots-Tag HTTP headers. Useful when you need per-response indexing directives that are independent of robots.txt. Not currently needed; existing robots.txt declarations cover the same ground.

The volume of “agent-readiness” recommendations will keep growing; the ones that survive the next year are the ones to invest in.

Closing

A reasonable position for a small content site, after a week of running checks: ship the three things in this guide, watch the draft RFCs for the bullets above, revisit in six months. The site that does that is in a better position than the site that scored 100% against a checker that itself will not exist in twelve.