All articles

Your Site Is Invisible to AI Agents (and Your CMS Can Fix It)

AI agents read your site as token-heavy HTML and cite it badly. Two CMS features fix it: an editor-owned llms.txt generated by AI, and per-page Markdown served to agents on the same URL.

[Author]
Edoardo Lunardi
[Published]
[Reading time]
An AI agent lands on your client's homepage to answer a question about their product. It pulls the page, and the first 8,000 tokens it reads are a cookie banner, a navigation menu, three analytics scripts, and the class names on a hero section. The actual sentence it needed, the one describing what the company does, sits somewhere past all of that, if the agent's context window reached it at all. The agent gives a vague answer, cites someone else, and moves on.
You built that page for a browser. A browser renders the markup and throws the rest away. An agent has to read the markup, and every tag and script costs it tokens it could have spent understanding the content. The same page as clean Markdown can run up to 97% smaller. That gap is the difference between an agent quoting your client accurately and an agent guessing.
I have shipped this layer on production Next.js and Sanity builds, and the surprise every time is how little of it is hard once the boundary is drawn right. Two features cover it. An editor serves a clean /llms.txt index that an agent reads to discover what the site contains, drafted by AI from the site's own content in one click. Any public page returns a token-light Markdown version of itself on its own URL when an agent asks. Discovery, then retrieval, and it is about to be the part clients ask for by name.

Why agents read your site differently

A browser and an agent want opposite things from the same URL. The browser wants HTML it can paint: layout, fonts, interactivity, the whole rendered document. The agent wants the content and nothing else, because everything else is tokens spent on noise it has to parse before it reaches a single useful sentence.
Two problems sit between an agent and your content. Discovery: before an agent reads a page, it has to know the page exists, and a large site does not fit in a context window for it to crawl around. Retrieval: once it knows a page exists, reading it as HTML is expensive and noisy. Solve one without the other and the agent either cannot find your pages or chokes on them when it does.
The fix is a layered one, and it is not new thinking. Sanity's own field guide on serving content to agents frames the exact model: an llms.txt index for discovery, plus per-page Markdown through content negotiation for retrieval. Vercel's write-up on agent-friendly pages lands on the same approach. The honest caveat: the research does not settle whether any of this lifts AI citations. Profound ran a randomized test across 381 pages and found no significant change in bot traffic from Markdown. This is early and experimental, and nobody can promise you rankings from it. The case for doing it anyway is that it costs almost nothing. Agents are already fetching your pages, the work is a toggle and a serializer, and clean content is the right thing to serve whether or not the citation upside ever materializes. Low cost, real upside if the bet pays, no downside if it does not.

Discovery: an llms.txt an editor owns

An /llms.txt is a single Markdown file at a known URL that hands an agent a concise map of your site with links to its pages. It follows the llmstxt.org convention that Jeremy Howard proposed in September 2024, now adopted by sites like Next.js, Cloudflare, and Hugging Face, with Anthropic publishing its own as a lightweight sitemap. It is an index of links with short descriptions, not a dump of your content. The whole file is tiny, often under a thousand tokens. Its job is to orient an agent at the moment its context window cannot hold the whole site.
A served file reads like this:
# Index

> Freelance creative frontend engineer based in Vienna, working
> worldwide on animation-heavy, interaction-rich sites on headless
> CMS stacks (Next.js, TypeScript, Sanity, Tailwind, Motion,
> Vercel). Shipped work for Buck, Disney, Porsche, Red Bull, and
> Getty. Awwwards juror and Codrops contributor.

## Core pages

- [Index](https://www.edoardolunardi.dev/): Role, stack, selected
  work, and the primary way to get in touch.
- [Works](https://www.edoardolunardi.dev/works): Portfolio index of
  production client projects.
- [Lab](https://www.edoardolunardi.dev/lab): Interaction experiments
  in motion, scroll, and novel UI.
- [Blog](https://www.edoardolunardi.dev/blog): Writing, including
  deep dives on content architecture.

## Articles

- [The Content Architecture I: CMS Structure](https://www.edoardolunardi.dev/blog/the-content-architecture-cms-structure):
  How to structure a CMS for long-term maintainability.
- [The Content Architecture II: Content Models](https://www.edoardolunardi.dev/blog/the-content-architecture-content-models):
  Designing schemas that match real content.
- [The Content Architecture III: Page Composition](https://www.edoardolunardi.dev/blog/the-content-architecture-page-composition):
  Assembling pages from reusable sections.
- [The Content Architecture IV: Content Primitives](https://www.edoardolunardi.dev/blog/the-content-architecture-content-primitives):
  Low-level building blocks for flexible pages.

## Optional

- For a quick hiring decision: Index, Works, a recent case study,
  and the Content Architecture series. To start a project, use the
  contact path on the Index page.
Most guides to building this file land in one of two places, and both have a cost. You hardcode a static file in public/, and it drifts the moment anyone publishes, because nothing updates it. Or you write a dynamic route handler that queries your CMS, which stays accurate but lives in code, so every change to what the file emphasizes means a developer and a deploy. An editor who wants to reword the summary or reorder which pages matter has to file a ticket.
The version I ship moves the file into the CMS. The content lives in a field on the Site document, so an editor owns it and edits it like any other field. It carries a Generate button. One click drafts the entire file from the site's own content using Sanity Agent Actions, Sanity's AI capability for generating content from a prompt plus structured inputs. The editor reviews the draft, edits anything, and publishes. No deploy.
The generation is grounded, which is what makes it trustworthy. The route reads an inventory of every indexable page from the CMS, the same visibility rules the sitemap uses, so nothing private leaks in. It builds one entry per page with the real title and a URL constructed in code, then passes that to the model as structured data. The model writes the descriptions but never touches a URL, because the URLs never pass through it. An agent gets links that resolve, every time, with descriptions an editor approved.
There is a deliberate split underneath. Generation reads your content with drafts overlaid, so the draft reflects work in progress. Serving only ever returns the published field. A draft never appears at /llms.txt. And the whole file is gated by a single toggle: turn Serve /llms.txt off and the URL returns 404 without deleting a word of the content.
In the Studio it is one tab and two switches. An editor opens the Site document, clicks Generate, reviews the draft, and flips the toggle to serve it. The same goes for any page: one switch in the SEO tab turns it into Markdown for agents.

Build notes from The Content Architecture

The Content Architecture is a Next.js and Sanity starter built on decisions like this one. The list covers the parts that take the longest to get right: the fetch layer, the agentic layer, the patterns that never make it into an estimate. New breakdowns land here first. Low volume, high signal.

Retrieval: Markdown on the same URL

The second feature is the larger one. When an agent fetches a page and its request says it prefers Markdown, the server returns clean Markdown built from the page's content instead of the full HTML document, on the same URL. A browser, which asks for HTML, still gets the HTML page. Same address, two representations, decided by who is asking.
The signal is the HTTP Accept header. A browser sends Accept: text/html,... and never lists Markdown, so it always gets HTML. An agent that sends Accept: text/markdown gets the Markdown. This is content negotiation, a part of HTTP that has existed for decades, used here to serve agents without giving them a separate set of URLs to discover. The /llms.txt index links your normal page URLs, and those same URLs answer in Markdown when an agent asks. The two features lock together with no extra wiring.
Every public page gets one new switch in its SEO tab, Serve Markdown to agents, on by default. Turn it off and that page stays HTML-only for agents. An editor decides, per page, without touching code. This is the control a bare llms.txt cannot give you: a single file is all-or-nothing, the whole corpus or nothing, with no way to hold a page back. A per-page switch is per-page governance.
The property that makes this safe to ship is what it costs normal traffic. Nothing. A browser's Accept header never contains "markdown," so the check that detects an agent fails on the first character and the request continues down the normal HTML path with no added work. Every reader who is not an agent pays nothing. Only the rare agent request does the eligibility read, and that read is cached and busted by webhook, so even agents barely touch the database.

The part that took three tries

Turning a CMS page into Markdown sounds like a formatting problem. It is an architecture problem, and the difference is whether the feature survives the next section type someone adds.
The naive version is one branch of code per section type. It works until someone builds a new section, at which point it silently renders nothing, and nobody notices until an agent reads an empty page. Every section type is a new place for the serializer to fall behind the schema.
The version that holds reads by convention. The schema is built from reusable field factories that always emit the same field names: a rich-text factory produces appRichText. A media factory always produces appMedia. A link factory always produces appLink. Those names are guaranteed on any section built from the factories, so the serializer reads them wherever they appear and renders each to Markdown, with no per-section branching. A new section serializes the day it is created.
That is the line most DIY versions never reach. They couple the converter to a fixed list of section types, and it rots as the schema grows. Anchoring to factory-guaranteed names handles the common case for free, leaving the rare project-specific field as a single labeled extension. The honest ceiling: the query layer cannot be generated from a config object, because the type generator needs static query strings. So the mapping is one small, well-marked edit point rather than a pretense that the coupling vanishes. That is the difference between a feature you ship and a demo that breaks when the schema changes.

How the two features combine

An agent finds your client's site through llms.txt. It reads the index, a map of real page URLs with descriptions an editor approved. It picks the page it needs and fetches that same URL with Accept: text/markdown. The page answers in lean Markdown, the body and nothing else, internal links resolved to absolute URLs so the agent can follow them deeper. The same URLs serve both halves, no special infrastructure between them. For agents that read the index but do not negotiate Markdown on their own, one line in the generation notes that every page is also available through the Accept header.
This is what AI-ready means underneath the marketing word. Not a chatbot bolted onto the site, but a clean map and clean content on the URLs agents already have. When a server answers with Content-Type: text/markdown, Claude Code skips its summarization step and feeds your content to the model verbatim, a fast path you earn by serving structured Markdown instead of letting the agent reverse-engineer it from HTML.

Why this is a competitive line now, not later

Every site your clients run is already being read by agents, and most are serving them the same token-heavy HTML they serve a browser. The agent spends most of its budget on noise, and the answer it gives about your client is worse for it. That cost is invisible right now because nobody is measuring which sites agents cite well. That will not stay invisible.
Serving content to agents is becoming part of the brief the way responsive design and Core Web Vitals did. The studios that treat it as a first-class layer, owned by editors, drafted by AI, correct after launch, will quote it as a differentiator while everyone else is still pasting a static file into public/ and watching it drift. The work is not large. The boundary is what is hard, and the boundary is what most implementations get wrong.
This is the kind of layer that gets rebuilt from scratch on every project because it never makes it into an estimate. In The Content Architecture, it is already decided and already shipped: the llms.txt generator wired to Agent Actions, the per-page Markdown toggle, the factory-guaranteed serializer, the proxy that serves agents without taxing browsers. A client build starts with the agentic layer done, past the part that usually costs the first few days and gets cut for time anyway.
The price increase lands July 16. The LAUNCH code is good until then.

Common questions

What is llms.txt and do I need it?

An llms.txt is a Markdown file at the root of your site that gives AI agents a concise, linked index of your pages, so they can find your content without crawling the whole site. It follows a convention Jeremy Howard proposed in September 2024, now used by sites like Next.js, Cloudflare, and Hugging Face. Whether it lifts AI citations is still unproven and experimental, but it costs almost nothing to add, so the practical answer is that it is worth doing as a low-cost discovery layer while the evidence catches up.

How do I generate llms.txt from a CMS instead of hardcoding it?

Query your CMS for every indexable page, build the entries in code with real URLs, and serve the result from a route handler so the file reflects published content automatically. A static file in public/ drifts the moment anyone publishes. The stronger version stores the file in the CMS as an editor-owned field and drafts it with an AI action, so editors reword and reorder it without a deploy while the URLs stay grounded in code.

How do I serve Markdown to AI agents from a Next.js site?

Detect the Accept: text/markdown header on incoming requests and return a Markdown version of the page on the same URL, while browsers asking for text/html get the normal page. This is HTTP content negotiation. Put the detection in the proxy so it can short-circuit browser traffic for free and consult per-page state, like whether the page is private or toggled off, before serving.

For engineers building on the stack

The Content Architecture is the Next.js and Sanity foundation that ships with the agentic layer already wired. If you build on this stack, the list is where new patterns, deep dives, and product updates land first. No filler, just the engineering.