A crawler does not crawl your site infinitely β it has a limit on how many pages it will fetch per unit of time. That limit is the crawl budget. If the budget is spent on junk URLs β duplicates, infinite filters, redirect chains β important pages get indexed slowly or not at all. For a small site this is rarely an issue, but for an e-commerce store with tens of thousands of URLs the crawl budget directly determines how fast new products and changes reach Google and Yandex results. Let us break down how it works, what eats it, and how to clean it up.
What crawl budget actually is
Google describes crawl budget through two components. The first is the crawl rate limit: how many simultaneous connections the crawler opens and what pause it keeps between requests so as not to overload your server. If the server responds fast and error-free, the limit rises; if it is slow or returns 5xx, it drops. The second is crawl demand: how interested the engine is in re-crawling your pages at all. Popular and frequently updated URLs are crawled more often, forgotten and static ones less. Crawl budget is the product of these two factors.
Does crawl budget matter for your site
The honest answer: most small sites (up to a few thousand pages) need not worry about crawl budget β the crawler gets to everything. It becomes a real problem when at least one condition holds: the site has tens or hundreds of thousands of URLs; there are auto-generated pages (filters, sorts, on-site search, calendars); content changes often and reindexing speed matters; you have noticed in Search Console that some pages sit for months in "Discovered β currently not indexed". If that is you, optimizing the budget yields a tangible lift in index coverage.
What wastes crawl budget
Budget leaks onto URLs the crawler does not need but fetches anyway because it found links to them or they are in the sitemap. Here are the biggest budget eaters, in order of how often they bite:
- Duplicate pages β the same content under multiple URLs (with/without trailing slash, with UTM tags, http/https, www/non-www, index.html). The crawler fetches each variant separately.
- Faceted navigation and filters β combinations of "color Γ size Γ price Γ sort" spawn millions of near-identical URLs. The classic e-commerce trap.
- Infinite spaces β calendars with an endless "next" link, pagination to nowhere, on-site search pages with arbitrary parameters.
- Redirect chains and loops β the crawler fetches every extra hop separately and may never reach the target (see our redirects guide).
- Soft 404s β pages that return a 200 status but are effectively empty or "nothing found". The crawler spends budget on them for zero value.
- Slow server and 5xx errors β lower the crawl rate limit: the crawler sees the server struggling and visits less often.
- Low-quality and thin pages β tags with a single entry, empty cards, auto-generated text. They drag down the overall crawl demand for the site.
How to find crawl problems
Before optimizing anything, you need to see where the crawler actually spends the budget. Three data sources. First β the Crawl Stats report in Google Search Console: it shows requests per day, average response size, response time, and a breakdown by status and file type. A sharp rise in requests to one URL type is a leak signal. Second β Yandex Webmaster: the "Crawl statistics" and "Pages in search" sections show what Yandex crawled and what it excluded and why. Third and most precise β server log analysis: logs show every bot hit (by Googlebot/YandexBot User-agent), and you can compute what share of crawling goes to junk URLs.
Log analysis is the underused but most honest technical-SEO tool. Export access logs for a couple of weeks, filter by bots, group by URL pattern, and look at the top of the crawl. Almost always you find that 30β60% of bot requests go to parameter URLs, filter pages or redirects β budget burning for nothing. That is your fix list.
How to optimize crawl budget
Optimization means redirecting the budget away from junk and onto valuable pages. In order of impact:
- Block useless URLs in robots.txt β sort parameters, on-site search, cart, service pages. The crawler will skip them and save budget (remember: robots.txt blocks crawling, not indexing β for already-indexed pages use noindex).
- Set canonical on duplicates and parameter URLs β point to the canonical version so equity and crawling concentrate there.
- Remove redirect chains β each 301 should point straight to the final URL, with no intermediate hops.
- Keep a clean sitemap.xml β only canonical, indexable URLs with accurate lastmod dates. The crawler trusts the sitemap and prioritizes what is listed.
- Speed up the server β lower TTFB, enable caching and compression. A fast response directly raises the crawl rate limit.
- Eliminate soft 404s β empty pages should honestly return 404 or 410, not 200.
- Strengthen internal linking to important pages β the more internal links point to a page, the higher its crawl priority.
Yandex specifics
Yandex crawls on its own logic and gives a bit more direct control than Google. Yandex Webmaster has a "Recrawl pages" tool β you can manually submit up to a few dozen URLs per day for priority crawling, useful for fresh important pages. The historical Crawl-delay directive in robots.txt was once honored by Yandex (a pause between requests), but now controlling speed via Webmaster settings is recommended over robots.txt. Yandex also removes blocked pages from the index more slowly than Google, so after changes be patient β recrawl and reindexing take longer in the Russian market.
Crawl optimization checklist
- Checked Crawl Stats in GSC and crawl statistics in Webmaster β you know where the budget goes.
- Parameter and service URLs blocked in robots.txt, duplicates given a canonical.
- Redirect chains collapsed to a single hop, no loops.
- Sitemap contains only canonical, indexable URLs with accurate lastmod.
- TTFB is low, no 5xx errors, soft 404s eliminated.
- Important pages are well interlinked and no deeper than 3 clicks from the homepage.
Crawl budget is not magic, it is hygiene: you help the crawler avoid wasting effort on junk and focus on what drives traffic. Site Metrics Tool helps keep it under control β monitoring indexing, Core Web Vitals (including TTFB), redirect chains and the technical state of the site in one dashboard across Google and Yandex, alerting you when something breaks.
Frequently asked
Do I even need to think about crawl budget?
If the site has fewer than a few thousand pages and no auto-generated URLs β almost certainly not, the crawler will get to everything. Crawl budget matters at tens of thousands of URLs, with faceted navigation, or when pages go unindexed for months.
Does robots.txt help save budget?
Yes β URLs blocked in robots.txt are not crawled, and the budget is redirected to useful pages. But remember: robots.txt blocks crawling, not indexing. If a page is already indexed, remove it via meta noindex (and do NOT block it in robots.txt, or the crawler will never see the noindex).
How can I speed up indexing of a new page?
Add the page to the sitemap, link to it internally from crawled pages, submit the URL for recrawl in Yandex Webmaster and via URL Inspection in GSC. The IndexNow protocol (supported by Yandex and Bing) notifies about new URLs almost instantly.
Does site speed affect crawling?
Directly. The faster the server responds (low TTFB) and the fewer 5xx errors, the higher the crawl rate limit β the crawler opens more connections and visits more often. A slow server throttles its own indexing.
What is a soft 404 and why is it harmful?
A soft 404 is a page that is effectively empty or says "nothing found" but returns HTTP 200 instead of 404. The crawler treats it as a working page, spends budget on it, and may index it as thin content. Empty pages should honestly return 404 or 410.