robots.txt: 8 common mistakes and how to catch them for free

robots.txt looks deceptively simple — five lines, three or four directives. Because of that simplicity it's treated carelessly: edited once a year, never tested, trusted to the CMS. Wrong. It's one of the riskiest files on a site: one wrong line can drop the whole domain from Google and Yandex within a week. This article walks through eight concrete mistakes we routinely catch on production sites via our free robots.txt validator — and how to find each in a minute.

1. Disallow: / left over from staging

A classic. On staging you set Disallow: / so Google wouldn't accidentally index the test domain. A month later the same file copies to production (via git, manual transfer, or a deploy bug). The site runs for a month, Google sees nothing. Symptom: traffic flat, then zeroes out within a week. GSC shows the "Indexed, though blocked by robots.txt" warning. Check: open /tools/robots-txt-validator, point it at domain.com/robots.txt — if the first directive is Disallow: /, your alarm is on. Fix: remove Disallow: / and replace with specific utility paths.

2. Case-sensitive paths

Paths in robots.txt are case-sensitive. If you write Disallow: /Admin, it won't block /admin, /ADMIN, /aDmIn. The engine treats upper and lower case as different. Check: take the URL you want to test and run both casings through the validator. Fix: duplicate directives for all variants, or normalise the server to a single case via 301 redirect to lowercase.

3. Allow and Disallow for the same path

Conflicting rules are a frequent headache. For example: Allow: /products/ + Disallow: /products. Which wins? Google — Allow (the longer or more specific match), Yandex — also Allow but with slightly different logic. The problem: cross-engine behaviour may diverge, making diagnosis harder. Our validator shows both outcomes — what Googlebot sees vs Yandexbot — in one table. Fix: write rules unambiguously, don't lean on "longer-match priority".

4. Blocking CSS and JS from crawling

An old habit — Disallow: /assets/, Disallow: /static/. The idea was "Google doesn't need these". But Google uses CSS and JS to understand how the page actually renders — for mobile assessment, Core Web Vitals, DOM scaffolding. Blocking CSS makes Google see "bare" HTML and judge the page as poorly built. Fix: open access to all static resources, block only admin and user areas (/admin, /account, /cart).

5. Wrong User-agent

Sometimes you see User-agent: GoogleBot or User-agent: yandex (lowercase). The correct forms: User-agent: Googlebot (one word, capital G), User-agent: Yandex (capital Y). Bot-name case matters too. If you mistype, the bot ignores those rules and applies the default User-agent: * block. Check: our validator runs the test per popular bot — Googlebot, Yandexbot, Bingbot — and shows which rules apply to each.

6. Missing Sitemap reference

Not critical but useful. The Sitemap: https://domain.com/sitemap.xml directive in robots.txt is the explicit way to tell the engine where your sitemap lives. Without it you submit manually in Search Console and Webmaster. With it — automatic. Verify it's present and points to the live sitemap. Note: if you have a sitemap index with children, point to the index, not each child individually.

7. Crawl-delay for Googlebot

The Crawl-delay directive exists, but Google ignores it. Crawl-delay: 10 under User-agent: Googlebot has zero effect. Google's crawl-rate control lives in GSC → Settings → Crawl rate. Yandex respects Crawl-delay, but usually it's better managed via Webmaster. Fix: don't rely on Crawl-delay for Google; use it only for Yandex and minor bots.

8. robots.txt over HTTPS differs from HTTP

If your site runs on both http and https (bad on its own but happens during transitions), robots.txt is applied per protocol. http.domain.com/robots.txt has its own file, https.domain.com/robots.txt has its own. Common pattern: you updated rules on the https version while http keeps serving the old file with Disallow: /. Fix: always redirect http to https at the server level, serve identical robots.txt. Validate both versions — divergence will be obvious.

How to automate robots.txt monitoring

A free validator is a one-off check. But robots.txt changes more often than you think: a new deploy, a new CMS plugin, a manual edit. In Site Metrics Tool we check robots.txt daily and diff against the previous version after you connect a project. If the file changed — email alert. If errors appeared — urgent alert. This turns "set once and forget" into "always under watch". Especially useful for teams of 3–5 people where someone might silently pull a staging directive into production.

Frequently asked

Can robots.txt fully remove a site from the index?

Not immediately, but within 2–4 weeks — yes. robots.txt blocks crawling, not indexing. But without crawling, Google and Yandex can't refresh page data and gradually drop pages as "unconfirmed". To actually remove a page — use meta robots noindex or the X-Robots-Tag header.

How often to check robots.txt?

After every deploy that might touch it (migrations, CMS updates, manual edits). Plus daily automated checks via Site Metrics Tool. At minimum — monthly manual review.

What's more important — robots.txt or meta noindex?

To exclude a page from the index — meta noindex or X-Robots-Tag. To manage crawl frequency and priority — robots.txt. Different tools, different jobs.

🗺️

Jun 7, 2026 · 12 min read

XML Sitemap: validation, upkeep and common pitfalls

How to maintain an XML sitemap properly: validate it, what to put in lastmod, how to split large sites into indexes, hreflang for bilingual sites.

🧱

Sep 4, 2026 · 14 min read

Schema.org markup: the advanced 2026 guide

A complete guide to structured data: every type for every page, JSON-LD vs Microdata, @id linking, testing and debugging.

🌐

Aug 19, 2026 · 13 min read

Hreflang for multilingual sites: the complete 2026 guide

What hreflang is, how to set it up correctly for two or more languages, typical mistakes, and why Google and Yandex treat hreflang differently.

⚙️

Aug 3, 2026 · 13 min read

JavaScript SEO in 2026: SPA, hydration, and why your page won't index

Why JavaScript-heavy sites index poorly, how Google and Yandex render JS, SSR/SSG/CSR differences, and a fix checklist.

1. Disallow: / left over from staging

2. Case-sensitive paths

3. Allow and Disallow for the same path

4. Blocking CSS and JS from crawling

5. Wrong User-agent

6. Missing Sitemap reference

7. Crawl-delay for Googlebot

8. robots.txt over HTTPS differs from HTTP

How to automate robots.txt monitoring

Frequently asked

Can robots.txt fully remove a site from the index?

How often to check robots.txt?

After every deploy that might touch it (migrations, CMS updates, manual edits). Plus daily automated checks via Site Metrics Tool. At minimum — monthly manual review.

What's more important — robots.txt or meta noindex?

To exclude a page from the index — meta noindex or X-Robots-Tag. To manage crawl frequency and priority — robots.txt. Different tools, different jobs.

🗺️

Jun 7, 2026 · 12 min read

XML Sitemap: validation, upkeep and common pitfalls

How to maintain an XML sitemap properly: validate it, what to put in lastmod, how to split large sites into indexes, hreflang for bilingual sites.

🧱

Sep 4, 2026 · 14 min read

Schema.org markup: the advanced 2026 guide

A complete guide to structured data: every type for every page, JSON-LD vs Microdata, @id linking, testing and debugging.

🌐

Aug 19, 2026 · 13 min read

Hreflang for multilingual sites: the complete 2026 guide

What hreflang is, how to set it up correctly for two or more languages, typical mistakes, and why Google and Yandex treat hreflang differently.

⚙️

Aug 3, 2026 · 13 min read

JavaScript SEO in 2026: SPA, hydration, and why your page won't index

Why JavaScript-heavy sites index poorly, how Google and Yandex render JS, SSR/SSG/CSR differences, and a fix checklist.

robots.txt: 8 common mistakes and how to catch them for free

1. Disallow: / left over from staging

2. Case-sensitive paths

3. Allow and Disallow for the same path

4. Blocking CSS and JS from crawling

5. Wrong User-agent

6. Missing Sitemap reference

7. Crawl-delay for Googlebot

8. robots.txt over HTTPS differs from HTTP

How to automate robots.txt monitoring

Frequently asked

Related articles

XML Sitemap: validation, upkeep and common pitfalls

Schema.org markup: the advanced 2026 guide

Hreflang for multilingual sites: the complete 2026 guide

JavaScript SEO in 2026: SPA, hydration, and why your page won't index

robots.txt: 8 common mistakes and how to catch them for free

1. Disallow: / left over from staging

2. Case-sensitive paths

3. Allow and Disallow for the same path

4. Blocking CSS and JS from crawling

5. Wrong User-agent

6. Missing Sitemap reference

7. Crawl-delay for Googlebot

8. robots.txt over HTTPS differs from HTTP

How to automate robots.txt monitoring

Frequently asked

Related articles

XML Sitemap: validation, upkeep and common pitfalls

Schema.org markup: the advanced 2026 guide

Hreflang for multilingual sites: the complete 2026 guide

JavaScript SEO in 2026: SPA, hydration, and why your page won't index