Googlebot Technical Guide: Crawling and Indexing

Googlebot is the generic name for Google's web crawlers that discover and retrieve web pages to build Google Search's index. The term covers two primary variants: Googlebot Smartphone and Googlebot Desktop. If Googlebot cannot access your content, your pages cannot appear in Google Search results, period.

What is Googlebot?

Googlebot is not a single crawler but a collective name for two distinct crawlers used by Google Search: Googlebot Smartphone (which simulates a mobile user) and Googlebot Desktop (which simulates a desktop user). You identify the subtype by examining the HTTP user-agent request header, but both types obey the same product token in robots.txt, meaning you cannot selectively target one or the other via robots.txt rules.

Since May 2019, Googlebot uses the latest Chromium rendering engine, making it "evergreen" and capable of processing modern JavaScript and CSS as users see them. Googlebot discovers URLs by harvesting links from previously crawled pages, sitemaps, RSS feeds, and URLs submitted via Google Search Console or the Indexing API.

Why Googlebot matters

Search visibility gatekeeper: No crawl means no index means no organic traffic.
Mobile-first dominance: Starting from September 2020, all sites were switched to mobile-first indexing. Consequently, the majority of Googlebot crawl requests come from the mobile crawler, using the mobile version of content for indexing.
Volume and speed: Googlebot is the fastest crawler on the web, accounting for 23.7% of HTTP requests from good bots (compared to Ahrefsbot at 14.27% and Bingbot at 4.57%). High response times directly impact how comprehensively Google can crawl your site.
JavaScript execution: Through its Web Rendering Service (WRS), Googlebot renders pages using cached CSS and JavaScript resources, ensuring dynamic content enters the index.
Diagnostic insight: The Crawl Stats Report in Google Search Console reveals how Googlebot interacts with your server, exposing bottlenecks before they tank rankings.

How Googlebot works

The crawling process follows a continuous pipeline:

URL discovery: Googlebot starts with a seed list of URLs from previous crawls, sitemap submissions, RSS feeds, and discovered links. New pages must be linked from known pages or manually submitted to be found.
Prioritization and fetching: Googlebot prioritizes URLs and fetches them, respecting robots.txt directives. It crawls the first 2MB of supported file types and the first 64MB of PDF files; content beyond these limits is ignored.
Resource fetching: Each resource referenced in HTML (CSS, JavaScript, API requests) is fetched separately for rendering purposes.
Rendering: The Web Rendering Service processes cached resources to view pages as a user would. Googlebot uses the mobile version of the rendered page for indexing.
Indexing and link extraction: Processed content enters the index, and newly discovered links return to the discovery queue.

Technical protocols: Googlebot supports HTTP/1.1 and HTTP/2 but using HTTP/2 provides no ranking boost in Google Search, though it may reduce CPU and RAM usage for your server. It accepts gzip, deflate, and Brotli compression. For caching, Googlebot supports ETag and If-None-Match headers (preferred over Last-Modified to avoid date formatting issues).

Types of Googlebot

While "Googlebot" generally refers to the search crawlers, Google operates several specialized variants:

Type	Purpose	User-Agent Identifier
Googlebot Smartphone	Simulates mobile devices for mobile-first indexing	`Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X...)`
Googlebot Desktop	Simulates desktop users	`Mozilla/5.0 AppleWebKit/537.36...`
Googlebot Image/Video	Specialized for media indexing	`Googlebot-Image/1.0` or `Googlebot-Video/1.0`
InspectionTool	Used by Search Console Rich Results Test and URL Inspection tools	`Google-InspectionTool/1.0`
Storebot	Crawls for Google Shopping	`Storebot-Google/1.0`
GoogleOther	Crawler for other Google products	`GoogleOther`

Both smartphone and desktop crawlers identify themselves distinctly but respect the same robots.txt rules.

Best practices

Verify crawler identity before acting. Malicious bots spoof the "Googlebot" user-agent string. Use reverse DNS lookup or match against Google's published IP ranges to confirm requests are genuine.

Monitor response times in Crawl Stats. Google recommends maintaining an average response time around 100ms; times nearing 1,000ms may limit Googlebot's ability to crawl your site comprehensively. Export data from Settings > Crawl Stats in GSC to track 90-day trends.

Implement ETag caching. Configure your server to return ETag headers instead of relying solely on Last-Modified. This reduces bandwidth during re-crawls and avoids date-parsing errors that can cause unnecessary refetches.

Distinguish crawl blocking from index blocking. Use robots.txt to manage server load by preventing crawling of non-essential files. Use the noindex meta tag to prevent content from appearing in search results. Remember that robots.txt does not prevent indexing if other sites link to the URL.

Optimize for mobile-first indexing. Since mobile Googlebot makes the majority of requests, ensure your mobile site contains all critical content and structured data. Check the "Googlebot type" breakdown in Crawl Stats to verify mobile crawl behavior.

Common mistakes

Mistake: Using robots.txt expecting content to disappear from Google Search. Fix: This only stops crawling. The URL can still appear in results if linked externally. Use noindex or remove the content entirely to eliminate search visibility.

Mistake: Trusting user-agent strings without verification. Fix: Spoofed bots trigger false alarms in server logs. Always verify via IP authentication or reverse DNS before whitelisting or blocking.

Mistake: Ignoring redirect chains from legacy campaigns. Fix: During migrations, outdated ad campaign URLs generating 301 redirects can spike response times. Update campaign links directly to new URLs rather than relying on redirects, reducing unnecessary server load that limits crawl comprehensiveness.

Mistake: Assuming HTTP/2 improves rankings. Fix: HTTP/2 saves computing resources but offers no ranking boost in Google Search. Do not migrate protocols solely for SEO gains.

Mistake: Blocking CSS and JavaScript files. Fix: Googlebot needs these resources to render pages correctly. Blocking them can cause indexing issues or "content mismatch" errors.

Examples

Migration monitoring scenario: A site migrating 20,000 URLs sees average response times spike to 4,299ms. Analysis of the Crawl Stats Report reveals the culprit is a surge in 301 redirects triggered by old ad campaign URLs still receiving traffic. By updating the campaign links to point directly to new URLs, the team reduces server load and restores response times to the recommended 100ms range, allowing migration to complete without indexing delays.

JavaScript rendering scenario: A single-page application relies heavily on client-side JavaScript to load product listings. Googlebot's Web Rendering Service processes the JavaScript, but the site ensures critical metadata and initial content reside within the first 2MB of the HTML to guarantee indexing even if rendering resources are delayed.

Verification scenario: A server log shows aggressive crawling from an IP claiming to be Googlebot. Using Google's published IP ranges, the webmaster discovers the IP does not match Googlebot's known addresses. They block the imposter, preventing a DDoS attack while ensuring legitimate Googlebot access remains open.

FAQ

What is the difference between Googlebot Smartphone and Desktop? Googlebot Smartphone simulates a mobile user agent and viewport, while Googlebot Desktop simulates a desktop environment. Since September 2020, all sites use mobile-first indexing, meaning Googlebot Smartphone performs the majority of crawl requests and Google indexes the mobile version of your content.

How can I verify that real Googlebot is visiting my site? Do not trust the user-agent string alone, as spoofing is common. Verify by performing a reverse DNS lookup on the requesting IP to confirm it resolves to a googlebot.com or google.com hostname. Alternatively, match the IP against Google's published IP ranges for crawlers.

Does blocking Googlebot remove my pages from Google Search? No. Blocking Googlebot via robots.txt prevents crawling but does not prevent indexing. If external sites link to the page, the URL may still appear in search results without a description. To remove content from search results, use a noindex meta tag or password-protect the page.

What file size limits does Googlebot have? Googlebot crawls the first 2MB of supported file types and the first 64MB of PDF files. Content beyond these limits is not indexed. Ensure critical content appears within these boundaries.

How do I slow down Googlebot if it is overwhelming my server? While the legacy crawl rate setting in Search Console has been deprecated, you can reduce crawl rate by returning appropriate HTTP status codes (such as 503 or 429) during peak times, or by using the crawl rate reduction request form for temporary relief.

Does Googlebot support HTTP/2? Yes. Googlebot supports both HTTP/1.1 and HTTP/2 and automatically selects the protocol offering better crawling performance. However, using HTTP/2 provides no ranking benefit in Google Search, though it may reduce server resource consumption.

What is mobile-first indexing? Mobile-first indexing means Google predominantly uses the mobile version of your site's content for indexing and ranking. As of September 2020, this applies to all websites. Consequently, Googlebot Smartphone makes the majority of crawl requests to your site.

Googlebot Technical Guide: Crawling and Indexing

What is Googlebot?

Why Googlebot matters

How Googlebot works

Types of Googlebot

Best practices

Common mistakes

Examples

FAQ

Related Terms

Google Search Console

Indexing

Mobile-First Indexing

robots.txt