robots.txt looks deceptively simple — five lines, three or four directives. Because of that simplicity it's treated carelessly: edited once a year, never tested, trusted to the CMS. Wrong. It's one of the riskiest files on a site: one wrong line can drop the whole domain from Google and Yandex within a week. This article walks through eight concrete mistakes we routinely catch on production sites via our free robots.txt validator — and how to find each in a minute.
1. Disallow: / left over from staging
A classic. On staging you set Disallow: / so Google wouldn't accidentally index the test domain. A month later the same file copies to production (via git, manual transfer, or a deploy bug). The site runs for a month, Google sees nothing. Symptom: traffic flat, then zeroes out within a week. GSC shows the "Indexed, though blocked by robots.txt" warning. Check: open /tools/robots-txt-validator, point it at domain.com/robots.txt — if the first directive is Disallow: /, your alarm is on. Fix: remove Disallow: / and replace with specific utility paths.
2. Case-sensitive paths
Paths in robots.txt are case-sensitive. If you write Disallow: /Admin, it won't block /admin, /ADMIN, /aDmIn. The engine treats upper and lower case as different. Check: take the URL you want to test and run both casings through the validator. Fix: duplicate directives for all variants, or normalise the server to a single case via 301 redirect to lowercase.
3. Allow and Disallow for the same path
Conflicting rules are a frequent headache. For example: Allow: /products/ + Disallow: /products. Which wins? Google — Allow (the longer or more specific match), Yandex — also Allow but with slightly different logic. The problem: cross-engine behaviour may diverge, making diagnosis harder. Our validator shows both outcomes — what Googlebot sees vs Yandexbot — in one table. Fix: write rules unambiguously, don't lean on "longer-match priority".
4. Blocking CSS and JS from crawling
An old habit — Disallow: /assets/, Disallow: /static/. The idea was "Google doesn't need these". But Google uses CSS and JS to understand how the page actually renders — for mobile assessment, Core Web Vitals, DOM scaffolding. Blocking CSS makes Google see "bare" HTML and judge the page as poorly built. Fix: open access to all static resources, block only admin and user areas (/admin, /account, /cart).
5. Wrong User-agent
Sometimes you see User-agent: GoogleBot or User-agent: yandex (lowercase). The correct forms: User-agent: Googlebot (one word, capital G), User-agent: Yandex (capital Y). Bot-name case matters too. If you mistype, the bot ignores those rules and applies the default User-agent: * block. Check: our validator runs the test per popular bot — Googlebot, Yandexbot, Bingbot — and shows which rules apply to each.
6. Missing Sitemap reference
Not critical but useful. The Sitemap: https://domain.com/sitemap.xml directive in robots.txt is the explicit way to tell the engine where your sitemap lives. Without it you submit manually in Search Console and Webmaster. With it — automatic. Verify it's present and points to the live sitemap. Note: if you have a sitemap index with children, point to the index, not each child individually.
7. Crawl-delay for Googlebot
The Crawl-delay directive exists, but Google ignores it. Crawl-delay: 10 under User-agent: Googlebot has zero effect. Google's crawl-rate control lives in GSC → Settings → Crawl rate. Yandex respects Crawl-delay, but usually it's better managed via Webmaster. Fix: don't rely on Crawl-delay for Google; use it only for Yandex and minor bots.
8. robots.txt over HTTPS differs from HTTP
If your site runs on both http and https (bad on its own but happens during transitions), robots.txt is applied per protocol. http.domain.com/robots.txt has its own file, https.domain.com/robots.txt has its own. Common pattern: you updated rules on the https version while http keeps serving the old file with Disallow: /. Fix: always redirect http to https at the server level, serve identical robots.txt. Validate both versions — divergence will be obvious.
How to automate robots.txt monitoring
A free validator is a one-off check. But robots.txt changes more often than you think: a new deploy, a new CMS plugin, a manual edit. In Site Metrics Tool we check robots.txt daily and diff against the previous version after you connect a project. If the file changed — email alert. If errors appeared — urgent alert. This turns "set once and forget" into "always under watch". Especially useful for teams of 3–5 people where someone might silently pull a staging directive into production.
Frequently asked
Can robots.txt fully remove a site from the index?
Not immediately, but within 2–4 weeks — yes. robots.txt blocks crawling, not indexing. But without crawling, Google and Yandex can't refresh page data and gradually drop pages as "unconfirmed". To actually remove a page — use meta robots noindex or the X-Robots-Tag header.
How often to check robots.txt?
After every deploy that might touch it (migrations, CMS updates, manual edits). Plus daily automated checks via Site Metrics Tool. At minimum — monthly manual review.
What's more important — robots.txt or meta noindex?
To exclude a page from the index — meta noindex or X-Robots-Tag. To manage crawl frequency and priority — robots.txt. Different tools, different jobs.