SurfacedBySurfacedBy Docs

Robots.txt Checker

Audit your robots.txt against the AI crawlers and check for blocked access.

Ask an AI:Open in ChatGPTOpen in Claude

Robots.txt Checker audits your robots.txt against the AI crawlers and surfaces which bots are allowed, which are restricted, and which are blocked entirely. It also checks response headers and your llms.txt file.

The page lives at Dashboard -> Tools -> Robots.txt Checker (/dashboard/{workspace-slug}/{domain-id}/robots).

Why this matters

AI platforms use named crawlers to fetch your pages for training, search indexing, or live retrieval. Blocking a crawler means you opt out of whatever that crawler feeds. The practical impact varies by bot. Examples of the headline ones:

  • GPTBot trains OpenAI's foundation models. Blocking it means your content does not contribute to future model training.
  • OAI-SearchBot indexes the web for ChatGPT search. Blocking it means ChatGPT will not cite your pages in answers.
  • ClaudeBot trains Anthropic's models. The retrieval counterpart is Claude-User, which fetches a page when a user pastes a link into Claude.
  • PerplexityBot indexes the web for Perplexity's answer engine. Blocking it means Perplexity will not cite you.
  • Google-Extended controls whether Google may use your pages to train and ground Gemini. Blocking it leaves classic Google Search intact but opts you out of Google AI training.

The per-bot table on the result page covers every crawler SurfacedBy tracks (currently 15+ bots from OpenAI, Anthropic, Google, Microsoft, Perplexity, Meta, Apple, ByteDance, Manus, and Common Crawl).

Blocking a retrieval bot has an immediate effect on whether you appear in answers. Blocking a search-indexing bot has a slower effect as the index refreshes. Blocking a training bot has a structural effect on whether future models learn from you.

How to run a check

  1. Open the page. The domain is pre-filled from the current tracked domain.
  2. Click "Re-check" to fetch a fresh robots.txt. The result is a live fetch, not a cached value.
  3. The status hero, per-bot table, header checks, and llms.txt card update with the new result.

What the result shows

Status hero. A single chip summarizing the result: all bots allowed, some bots restricted, or AI bots blocked entirely. The fetched robots.txt content appears below the chip in a copyable code block.

Per-bot access table. One row per known AI bot, sorted into purpose groups (training, retrieval, multi-purpose). Each row shows the bot name, the status (Allowed, Restricted, or Blocked), the matching rule from your robots.txt that produced the status, and the evidence (the specific Disallow line, the User-agent block, or the catch-all rule).

Header checks. The X-Robots-Tag, meta robots, and Cache-Control directives observed on your homepage and on a sample inner page. Header-level blocks override robots.txt allow rules, so a page can be unblocked in robots.txt and still be uncrawlable because of an X-Robots-Tag.

llms.txt check. Whether your site serves an llms.txt file at the root and whether the file conforms to the published format. On Pro and above, a "Generate llms.txt" action produces a recommended file based on the content SurfacedBy has already discovered on your site.

Recommended fixes. When the check finds blocks you did not intend, the page lists recommended changes with copy-paste snippets you can drop into your robots.txt.

Re-check rate limits

The re-check action is rate-limited. Free tier checks less often than Starter, Pro, and Business. The rate limit prevents repeated fetches against the same domain from interfering with your actual traffic.

Notes on the result

Every check is a fresh live fetch of your robots.txt. The result is not cached. If you change your robots.txt and click "Re-check", the next response reflects the change as soon as your origin serves the new file. The historical record of checks is persisted so you can see when access changed.

On this page