Robots.txt is a plain text file placed in a website's root directory that implements the Robots Exclusion Protocol (REP), instructing web crawlers which portions of the site they may access. It serves as a gatekeeper for automated agents ranging from search engine spiders to AI data scrapers. For SEO practitioners, it controls crawl budget allocation and prevents server overload while recently becoming the primary defense against unauthorized AI training data collection.
What is Robots.txt
The Robots Exclusion Protocol emerged in 1994 when Martijn Koster proposed a standard for developers to specify which bots should avoid specific pages or servers. The standard was formally published in September 2022 as RFC 9309, establishing official technical specifications after decades of de facto use.
A site owner places the file at the root of the host (e.g., https://www.example.com/robots.txt). The file covers only one origin, meaning each subdomain, protocol (HTTP vs HTTPS), and port requires its own robots.txt file. If the file is absent, crawlers assume no restrictions exist.
Why Robots.txt matters
- Manage crawl budget. Direct search engines toward high-value content and away from duplicate or thin pages to ensure efficient indexing of priority URLs.
- Conserve server resources. Reduce unnecessary HTTP requests from bots that might otherwise overwhelm server capacity, particularly important for large sites or during traffic spikes.
- Block AI training data collection. In 2023, research found 306 of the thousand most-visited websites blocked OpenAI's GPTBot, making robots.txt the frontline defense against unauthorized generative AI scraping.
- Prevent indexing of non-public areas. Keep staging sites, internal search results, and administrative panels out of search engine results (though this is advisory; malicious bots may still access these paths).
- Reference sitemaps. Point crawlers to XML sitemaps to ensure discovery of important content without relying solely on internal linking.
How Robots.txt works
When a bot arrives at a site, it requests the robots.txt file before crawling any other content. The bot parses the instructions to determine which URLs it may fetch.
- Identify the user-agent. The bot checks for rules targeting its specific user-agent string (e.g., "Googlebot" or "Bingbot") before falling back to the wildcard entry (
*). - Parse directives. The file contains directives such as
Disallow(block access),Allow(override disallow),Crawl-delay(suggest timing between requests), andSitemap(indicate sitemap location). - Apply pattern matching. The wildcard
*and end-of-line$allow pattern matching for flexible path blocking, such as blocking all URLs containing specific parameters. - Respect scope limitations. The protocol relies on voluntary compliance. The Internet Archive announced in 2017 that it would stop complying with robots.txt directives, illustrating that archival and some AI bots may ignore restrictions.
Note that Crawl-delay interpretation varies by crawler. Yandex interprets it as seconds between visits, while Bing treats it as a time window during which the bot accesses the site only once. Google ignores this directive, offering crawl rate controls through Search Console instead.
Best practices
- Place the file correctly. Upload robots.txt to the root directory with exact lowercase filename (robots.txt, not Robots.TXT) and ensure it returns a 200 status code.
- Target specific user-agents. Create separate rule blocks for different bots rather than using only the global wildcard, allowing granular control over behavior like blocking GPTBot while permitting Googlebot.
- Explicitly declare sitemaps. Include the
Sitemapdirective with the full URL to help search engines discover your content structure efficiently. - Allow critical resources. Do not block CSS, JavaScript, or image files that bots need to render pages properly for indexing.
- Test before deploying. Use validation tools to check syntax. Google open-sourced its parser, and third-party checkers such as Tame the Bots can verify rule logic.
- Avoid security through obscurity. Do not rely on robots.txt to protect sensitive data. Use password authentication or noindex directives instead, as malicious bots may use the file as a directory of valuable targets.
Common mistakes
- Mistake: Assuming blocking equals indexing prevention. Explanation: A Disallowed URL can still appear in search results if linked from external sites, though without a description snippet. Fix: Use the
noindexmeta tag or X-Robots-Tag HTTP header alongside robots.txt blocks to prevent indexing. - Mistake: Using robots.txt for security. Explanation: The file is publicly readable and may guide attackers to hidden admin panels or private directories. Fix: Implement proper authentication mechanisms such as password protection for sensitive areas.
- Mistake: Blocking render-critical resources. Explanation: Disallowing access to CSS or JavaScript files prevents search engines from rendering pages correctly, potentially hurting rankings. Fix: Ensure your Allow/Disallow patterns permit access to assets necessary for page rendering.
- Mistake: Incorrect file location. Explanation: Placing robots.txt in subdirectories or using incorrect capitalization prevents crawlers from finding it. Fix: Maintain strictly at root level (example.com/robots.txt) with UTF-8 encoding.
- Mistake: Overly broad wildcards. Explanation: Patterns like
Disallow: /followed byAllow: /specific/might be interpreted differently by various crawlers. Fix: Test patterns specifically against each target crawler's documented behavior.
Examples
Example scenario: Blocking all crawlers from the entire site
User-agent: *
Disallow: /
Example scenario: Allowing all crawlers full access (equivalent to empty or missing file)
User-agent: *
Disallow:
Example scenario: Blocking specific AI scrapers from content directories
User-agent: GPTBot
Disallow: /blog/
Disallow: /articles/
User-agent: ClaudeBot
Disallow: /
Example scenario: Slowing Bing crawler while allowing Google full speed
User-agent: Bingbot
Crawl-delay: 10
User-agent: Googlebot
Disallow:
Sitemap: https://www.example.com/sitemap.xml
Robots.txt vs Meta Robots Tag
| Feature | Robots.txt | Meta Robots Tag / X-Robots-Tag |
|---|---|---|
| Scope | Site-wide or directory-level | Individual page level |
| Timing | Inspected before page request | Inspected after page loads (HTTP header) or during HTML parsing (meta tag) |
| Primary use | Manage crawl traffic, block crawling | Manage indexing, block indexation |
| Security | Publicly visible file | Not visible in robots.txt |
| Effect on URL in results | URL may appear without snippet if blocked | URL typically omitted from results if noindexed |
Rule of thumb: Use robots.txt to prevent crawling of large sections or manage server load. Use meta robots tags or X-Robots-Tag headers to prevent indexing of specific pages while allowing crawlers to access them.
FAQ
Is robots.txt legally enforceable? No. The protocol operates on voluntary compliance. While major search engines respect it, malicious bots, scrapers, and some archival projects may ignore directives. The 1999 case eBay v. Bidder's Edge resulted in an injunction against a non-compliant bot, but this addressed server trespass rather than enforcing the protocol itself. For legal protection of content, use copyright notices and technical access controls rather than relying on robots.txt.
Can I block AI bots with robots.txt? Yes. Many websites now block generative AI crawlers like GPTBot, ClaudeBot, and PerplexityBot by adding their specific user-agents to Disallow rules. However, compliance varies; some AI companies have circumvented blocks by renaming scrapers or creating new user-agents. In 2025, the nonprofit RSL Collective announced the Really Simple Licensing standard as a proposed method to set licensing terms for AI bots via robots.txt extensions, though adoption remains limited.
Why are my pages still indexed if I blocked them in robots.txt?
Robots.txt prevents crawling, not indexing. If external sites link to a blocked URL, search engines may still list the URL in results without accessing the content. To remove pages from search results entirely, use a noindex directive on the page itself or password-protect the content.
What is the maximum size for a robots.txt file? RFC 9309 requires crawlers to parse at least 500 kibibytes (512,000 bytes) of the file. Google maintains this as a file size restriction. If your robots.txt exceeds this limit, crawlers may truncate it, ignoring rules appearing at the end of the file. Keep your file under this size by consolidating redundant rules and avoiding excessive commentary or whitespace.
Do all search engines follow the same syntax rules?
No. While RFC 9309 standardizes core directives, interpretation varies. For example, the Crawl-delay directive is not part of the official standard and is handled differently by Bing (time window) and Yandex (seconds between requests), while Google ignores it entirely. Always verify behavior against specific crawler documentation.