Robots.txt Guide: Syntax, Directives & Best Practices

Robots.txt is a plain text file placed in a website's root directory that implements the Robots Exclusion Protocol (REP), instructing web crawlers which portions of the site they may access. It serves as a gatekeeper for automated agents ranging from search engine spiders to AI data scrapers. For SEO practitioners, it controls crawl budget allocation and prevents server overload while recently becoming the primary defense against unauthorized AI training data collection.

What is Robots.txt

The Robots Exclusion Protocol emerged in 1994 when Martijn Koster proposed a standard for developers to specify which bots should avoid specific pages or servers. The standard was formally published in September 2022 as RFC 9309, establishing official technical specifications after decades of de facto use.

A site owner places the file at the root of the host (e.g., https://www.example.com/robots.txt). The file covers only one origin, meaning each subdomain, protocol (HTTP vs HTTPS), and port requires its own robots.txt file. If the file is absent, crawlers assume no restrictions exist.

Why Robots.txt matters

Manage crawl budget. Direct search engines toward high-value content and away from duplicate or thin pages to ensure efficient indexing of priority URLs.
Conserve server resources. Reduce unnecessary HTTP requests from bots that might otherwise overwhelm server capacity, particularly important for large sites or during traffic spikes.
Block AI training data collection. In 2023, research found 306 of the thousand most-visited websites blocked OpenAI's GPTBot, making robots.txt the frontline defense against unauthorized generative AI scraping.
Prevent indexing of non-public areas. Keep staging sites, internal search results, and administrative panels out of search engine results (though this is advisory; malicious bots may still access these paths).
Reference sitemaps. Point crawlers to XML sitemaps to ensure discovery of important content without relying solely on internal linking.

How Robots.txt works

When a bot arrives at a site, it requests the robots.txt file before crawling any other content. The bot parses the instructions to determine which URLs it may fetch.

Identify the user-agent. The bot checks for rules targeting its specific user-agent string (e.g., "Googlebot" or "Bingbot") before falling back to the wildcard entry (*).
Parse directives. The file contains directives such as Disallow (block access), Allow (override disallow), Crawl-delay (suggest timing between requests), and Sitemap (indicate sitemap location).
Apply pattern matching. The wildcard * and end-of-line $ allow pattern matching for flexible path blocking, such as blocking all URLs containing specific parameters.
Respect scope limitations. The protocol relies on voluntary compliance. The Internet Archive announced in 2017 that it would stop complying with robots.txt directives, illustrating that archival and some AI bots may ignore restrictions.

Note that Crawl-delay interpretation varies by crawler. Yandex interprets it as seconds between visits, while Bing treats it as a time window during which the bot accesses the site only once. Google ignores this directive, offering crawl rate controls through Search Console instead.

Best practices

Place the file correctly. Upload robots.txt to the root directory with exact lowercase filename (robots.txt, not Robots.TXT) and ensure it returns a 200 status code.
Target specific user-agents. Create separate rule blocks for different bots rather than using only the global wildcard, allowing granular control over behavior like blocking GPTBot while permitting Googlebot.
Explicitly declare sitemaps. Include the Sitemap directive with the full URL to help search engines discover your content structure efficiently.
Allow critical resources. Do not block CSS, JavaScript, or image files that bots need to render pages properly for indexing.
Test before deploying. Use validation tools to check syntax. Google open-sourced its parser, and third-party checkers such as Tame the Bots can verify rule logic.
Avoid security through obscurity. Do not rely on robots.txt to protect sensitive data. Use password authentication or noindex directives instead, as malicious bots may use the file as a directory of valuable targets.

Common mistakes

Mistake: Assuming blocking equals indexing prevention. Explanation: A Disallowed URL can still appear in search results if linked from external sites, though without a description snippet. Fix: Use the noindex meta tag or X-Robots-Tag HTTP header alongside robots.txt blocks to prevent indexing.
Mistake: Using robots.txt for security. Explanation: The file is publicly readable and may guide attackers to hidden admin panels or private directories. Fix: Implement proper authentication mechanisms such as password protection for sensitive areas.
Mistake: Blocking render-critical resources. Explanation: Disallowing access to CSS or JavaScript files prevents search engines from rendering pages correctly, potentially hurting rankings. Fix: Ensure your Allow/Disallow patterns permit access to assets necessary for page rendering.
Mistake: Incorrect file location. Explanation: Placing robots.txt in subdirectories or using incorrect capitalization prevents crawlers from finding it. Fix: Maintain strictly at root level (example.com/robots.txt) with UTF-8 encoding.
Mistake: Overly broad wildcards. Explanation: Patterns like Disallow: / followed by Allow: /specific/ might be interpreted differently by various crawlers. Fix: Test patterns specifically against each target crawler's documented behavior.

Examples

Example scenario: Blocking all crawlers from the entire site

User-agent: *
Disallow: /

Example scenario: Allowing all crawlers full access (equivalent to empty or missing file)

User-agent: *
Disallow:

Example scenario: Blocking specific AI scrapers from content directories

User-agent: GPTBot
Disallow: /blog/
Disallow: /articles/

User-agent: ClaudeBot
Disallow: /

Example scenario: Slowing Bing crawler while allowing Google full speed

User-agent: Bingbot
Crawl-delay: 10

User-agent: Googlebot
Disallow:
Sitemap: https://www.example.com/sitemap.xml

Robots.txt vs Meta Robots Tag

Feature	Robots.txt	Meta Robots Tag / X-Robots-Tag
Scope	Site-wide or directory-level	Individual page level
Timing	Inspected before page request	Inspected after page loads (HTTP header) or during HTML parsing (meta tag)
Primary use	Manage crawl traffic, block crawling	Manage indexing, block indexation
Security	Publicly visible file	Not visible in robots.txt
Effect on URL in results	URL may appear without snippet if blocked	URL typically omitted from results if noindexed

Rule of thumb: Use robots.txt to prevent crawling of large sections or manage server load. Use meta robots tags or X-Robots-Tag headers to prevent indexing of specific pages while allowing crawlers to access them.

FAQ

Is robots.txt legally enforceable? No. The protocol operates on voluntary compliance. While major search engines respect it, malicious bots, scrapers, and some archival projects may ignore directives. The 1999 case eBay v. Bidder's Edge resulted in an injunction against a non-compliant bot, but this addressed server trespass rather than enforcing the protocol itself. For legal protection of content, use copyright notices and technical access controls rather than relying on robots.txt.

Can I block AI bots with robots.txt? Yes. Many websites now block generative AI crawlers like GPTBot, ClaudeBot, and PerplexityBot by adding their specific user-agents to Disallow rules. However, compliance varies; some AI companies have circumvented blocks by renaming scrapers or creating new user-agents. In 2025, the nonprofit RSL Collective announced the Really Simple Licensing standard as a proposed method to set licensing terms for AI bots via robots.txt extensions, though adoption remains limited.

Why are my pages still indexed if I blocked them in robots.txt? Robots.txt prevents crawling, not indexing. If external sites link to a blocked URL, search engines may still list the URL in results without accessing the content. To remove pages from search results entirely, use a noindex directive on the page itself or password-protect the content.

What is the maximum size for a robots.txt file? RFC 9309 requires crawlers to parse at least 500 kibibytes (512,000 bytes) of the file. Google maintains this as a file size restriction. If your robots.txt exceeds this limit, crawlers may truncate it, ignoring rules appearing at the end of the file. Keep your file under this size by consolidating redundant rules and avoiding excessive commentary or whitespace.

Do all search engines follow the same syntax rules? No. While RFC 9309 standardizes core directives, interpretation varies. For example, the Crawl-delay directive is not part of the official standard and is handled differently by Bing (time window) and Yandex (seconds between requests), while Google ignores it entirely. Always verify behavior against specific crawler documentation.

Robots.txt Guide: Syntax, Directives & Best Practices

What is Robots.txt

Why Robots.txt matters

How Robots.txt works

Best practices

Common mistakes

Examples

Robots.txt vs Meta Robots Tag

FAQ

Related Terms

Crawl Budget

Noindex

Sitemap

X-Robots-Tag