A sitemap is a file where you provide information about the pages, videos, images, and other files on your site, and the relationships between them. Search engines read this file to crawl your site more efficiently. It serves as a direct communication channel that tells search engines which pages you deem important and supplies critical metadata such as last update dates and alternate language versions.
What is a Sitemap?
The term "sitemap" refers to three distinct concepts. First, designers use sitemaps during website planning to map information architecture. Second, human-visible sitemaps present hierarchical page listings to help visitors navigate. Third, and most critical for SEO, structured XML sitemaps provide machine-readable lists of URLs specifically for web crawlers like Googlebot.
XML sitemaps follow a protocol originally introduced by Google and now jointly supported by Bing, Yahoo, and other search engines. These files typically reside at /sitemap.xml and must be UTF-8 encoded. They act as supplementary discovery mechanisms, particularly for pages that dynamic URL construction or internal search tools might otherwise hide from crawler navigation.
Why Sitemaps matter
- Ensure discovery of deep content. On large sites, it becomes difficult to guarantee every important page maintains at least one internal link. Sitemaps provide a fallback method for Googlebot to find these URLs.
- Accelerate indexing for new sites. If your site has few external backlinks, crawlers may never discover your pages through normal web traversal. A sitemap provides immediate entry points.
- Support rich media indexing. You can submit specialized sitemap extensions for video content (including runtime and ratings), images (including locations), and Google News articles (including publication dates).
- Signal content freshness. While Google ignores frequency hints (
changefreq) and priority values (priority), it may use thelastmodtag if you maintain consistently accurate timestamps. (Google Search Central). - Manage scale efficiently. Sites exceeding 50,000 URLs or 50MB file sizes can use sitemap index files to organize multiple sitemap files without losing crawler attention.
How Sitemaps work
A sitemap file uses XML tags to structure URL data. The file must begin with an opening <urlset> tag containing the protocol namespace and end with </urlset>. Each URL entry requires:
<loc>: The canonical URL (must begin with protocol, max 2,048 characters)- Parent
<url>tags wrapping each entry
Optional tags include <lastmod> (modification date in W3C format), <changefreq> (update frequency hints), and <priority> (relative importance from 0.0 to 1.0).
Technical constraints:
- Maximum 50,000 URLs per file
- Maximum 50MB (52,428,800 bytes) uncompressed file size
- Must reside on the same host and protocol as the URLs listed
- Special characters require entity escaping (e.g., & becomes &)
For larger sites, create a sitemap index file that lists individual sitemap files. This index follows similar constraints (max 50,000 sitemaps, 50MB) and uses <sitemapindex> parent tags with <sitemap> and <loc> child entries.
Types of Sitemaps
| Type | Purpose | When to use |
|---|---|---|
| XML Sitemap | Standard crawler feed | All sites needing search engine discovery |
| Image Sitemap | Image-specific metadata | When you want images indexed separately from pages |
| Video Sitemap | Video runtime, rating, age-appropriateness | Sites with significant video content |
| News Sitemap | Article titles and publication dates | Publishers appearing in Google News |
| HTML Sitemap | Human navigation aid | Large sites where users need page overviews |
| Text File | Simple URL lists (one per line) | Legacy systems or simple URL dumps |
Alternative formats include RSS 2.0 or Atom feeds, though these typically only expose recent URLs rather than complete site structures.
Best practices
Place your sitemap at the root directory. Locating your sitemap at http://example.com/sitemap.xml allows it to reference any URL starting with that domain. Sitemaps in subdirectories cannot list URLs from parent directories.
Reference your sitemap in robots.txt. Add the line Sitemap: http://www.example.com/sitemap.xml to your robots file. This works independently of user-agent lines and provides implicit proof of ownership for cross-submits.
Maintain accurate lastmod dates. Update this value only when the linked page actually changes, not when you regenerate the sitemap. Consistent accuracy increases the likelihood search engines will use this data.
Use sitemap indexes for scale. When you exceed 50,000 URLs or the 50MB limit, split content into multiple sitemaps and list them in an index file rather than truncating content.
Compress with gzip. You may gzip sitemap files to reduce bandwidth, provided the uncompressed file remains under 50MB.
Validate XML structure. Check your sitemap against the official XSD schema to prevent parsing errors that could prevent indexing.
Exclude non-indexable URLs. Do not include pages blocked by robots.txt or meta noindex tags in your sitemap, as this creates conflicting signals.
Common mistakes
Including non-canonical URLs. Your sitemap should only list the preferred version of each page. Including parameterized URLs or http/https duplicates wastes crawl budget.
Exceeding file limits. If your sitemap contains more than 50,000 URLs or exceeds 50MB uncompressed, search engines may reject it entirely. Fix by splitting into multiple files and using a sitemap index.
Inaccurate lastmod timestamps. Setting all dates to the current date invalidates the signal. Fix by matching the tag to the actual file modification date.
Forgetting entity escaping. URLs containing ampersands, quotes, or angle brackets break XML parsers unless converted to escape codes like &.
Submitting via unauthorized hosts. If you host sitemaps for multiple domains on a single server, you must verify ownership through robots.txt references on each target domain or through cross-submit protocols.
Relying on priority and changefreq. Google ignores these values when determining crawl frequency or ranking. Do not spend resources optimizing them. (Google Search Central).
Examples
Basic XML Sitemap structure:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.example.com/page1</loc>
<lastmod>2024-01-15</lastmod>
</url>
<url>
<loc>http://www.example.com/page2</loc>
<lastmod>2024-01-10</lastmod>
</url>
</urlset>
Sitemap Index for large sites:
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>http://www.example.com/sitemap1.xml.gz</loc>
<lastmod>2024-01-15T18:00:15+00:00</lastmod>
</sitemap>
<sitemap>
<loc>http://www.example.com/sitemap2.xml.gz</loc>
</sitemap>
</sitemapindex>
Example scenario: An e-commerce site with 60,000 product pages splits its catalog into two XML files (sitemap-products-1.xml and sitemap-products-2.xml), each containing 30,000 URLs. Both files are listed in sitemap-index.xml, which is referenced in robots.txt and submitted through Google Search Console.
FAQ
Do I need a sitemap if my site is small? Not necessarily. If your site has approximately 500 pages or fewer and maintains comprehensive internal linking so that every important page is reachable from the homepage, Google can usually discover your content without a sitemap. (Google Search Central).
Does submitting a sitemap guarantee indexing? No. Sitemaps help crawlers discover URLs, but being crawled does not guarantee that search engines will index the content or display it in results.
How do I submit my sitemap to search engines?
You have three methods: submit directly through search engine webmaster interfaces (like Google Search Console), reference the URL in your robots.txt file using Sitemap: [URL], or send an HTTP ping request to the search engine's specific endpoint.
What is the difference between XML and HTML sitemaps? XML sitemaps are machine-readable files designed for search engine crawlers. HTML sitemaps are human-visible web pages that help site visitors navigate your content hierarchy. They serve different audiences and should both be maintained for optimal site health.
Should I include every page on my site? Only include pages you want indexed in search results. Exclude utility pages, duplicate content, pages blocked by robots.txt, and pages with meta noindex tags.
How often should I update my sitemap? Update whenever you add, remove, or significantly modify content. For static sites, monthly updates suffice. For news or e-commerce sites, update daily or use automated generation.
Can I use text files instead of XML? Yes. Search engines accept simple text files containing one URL per line, following the same limits (50,000 URLs, 50MB, UTF-8 encoding). However, text files cannot include metadata like lastmod dates.