Ki-Ki

Web foundations for SMEs

Knowledge hub / Bot traffic and crawling

Why your robots.txt matters more than you think

Even small websites benefit from a simple, correct robots.txt file. It helps search engines understand your structure, reduces crawl noise, and stops accidental indexing of clutter.

What robots.txt actually does

A robots.txt file is a plain text file at the root of your site, usually at https://example.com/robots.txt. It tells search engines and other well behaved crawlers which parts of your site they should or should not crawl.

It is not a security feature and it is not a privacy tool. It is a hint for bots that choose to respect it. Good search engines listen. Attackers and scrapers usually ignore it.

When robots.txt is missing or incorrect, search engines guess how to crawl your site. They do not always guess well.

Why small organisations forget about it

Many small sites are built quickly and then updated over the years by different people. The original builder might have added a basic robots.txt or none at all. Hosting panels and CMS plugins sometimes auto generate one without explaining what it does.

The result is a file that nobody owns. It sits there for years, copied between versions of the site, quietly shaping how search engines and other bots see you.

Common mistakes that cause real problems

1. No robots.txt at all

If you have no robots.txt, search engines can still crawl your site, but they do it without guidance. Important pages may be found later than they should be and crawl effort may be wasted on clutter.

2. Accidentally blocking the whole site

A single rule can hide your entire website from search engines:

User-agent: *
Disallow: /

This is sometimes left over from a staging version of the site and forgotten during launch. Everything looks fine to staff, but nothing appears in search results.

3. Using robots.txt as a security tool

Some sites list admin, backup, or config folders inside robots.txt hoping to hide them:

Disallow: /admin/
Disallow: /backups/
Disallow: /config/

This does not hide anything. It advertises interesting folders. Search engines honour the rule, but attackers and scanners do not.

4. Blocking CSS or JavaScript by accident

Old templates sometimes block folders like /assets/, /js/, or /css/. Modern search engines fetch these files to understand how pages render on phones. If you block them, your pages can be misclassified as broken or not mobile friendly.

5. Missing sitemap entry

Adding a sitemap line makes crawling more reliable and helps search engines find new content:

Sitemap: https://example.com/sitemap.xml

Without it, crawlers have to discover pages through internal links alone, which can be slow if your navigation is shallow or has gaps.

A simple robots.txt that works for most small sites

For a typical small site or WordPress install, something like this is usually enough:

User-agent: *
Allow: /

# Keep clutter out of the index
Disallow: /wp-admin/
Disallow: /cgi-bin/
Disallow: /temp/
Disallow: /private/

Sitemap: https://example.com/sitemap.xml

Adjust the folders to match your reality. The aim is to keep low value or purely technical paths out of search results while leaving real content open.

Remember: this is not a fence. It is a hint. If a folder genuinely contains sensitive material, it needs proper access controls, not a Disallow line.

Where robots.txt fits with bot traffic and Cloudflare

Robots.txt is only one part of how bots interact with your site:

  • Robots.txt guides search engines and other polite crawlers.
  • Cloudflare and similar tools control which requests reach your origin at all.
  • Firewall rules and rate limits handle the worst behaviour.
  • Logs and analytics show you which patterns are worth acting on.

A healthy setup uses robots.txt for guidance and tools like Cloudflare for enforcement. Trying to do enforcement inside robots.txt is a dead end.

If you want to go deeper into this, see Cloudflare basics for small organisations and how evidence grade logs change the outcome of a dispute.

How to add or fix robots.txt in practice

The exact steps depend on how your site is hosted, but the pattern is the same:

  • Create a plain text file called robots.txt.
  • Place it in the web root, next to your main index file.
  • Paste in a simple template adjusted for your folders and sitemap URL.
  • Visit https://yourdomain/robots.txt in a browser to confirm it loads.
  • Use search engine tools such as Google Search Console to test how they read it.

If you use a CMS plugin that manages robots.txt for you, make sure you understand what it generates. It is better to have one clear source of truth than several plugins fighting over the same file.

In plain English

  • Robots.txt is a guide for search engines, not a lock on your site.
  • A missing or incorrect file can make important pages harder to find in search.
  • Listing secret folders in robots.txt does not hide them, it highlights them.
  • A clean, simple file plus a correct sitemap entry is enough for most small organisations.

Common questions about robots.txt

Do I need a robots.txt file if my site is very small

Yes. Even a small site benefits from a basic robots.txt that confirms the site is open to crawling and points search engines to your sitemap. It takes minutes to add and removes guesswork.

Can I use robots.txt to block bad bots

Not effectively. Bad bots ignore robots.txt. To slow them down you need firewall rules, bot rules, and sensible rate limits at the edge, for example through Cloudflare.

Is it safe to list admin or backup folders in robots.txt

No. It is better to protect those folders with authentication and network controls. If you list them in robots.txt, you simply advertise where the interesting things live.

What happens if I get robots.txt wrong

The worst case is that you hide your entire site or important sections from search engines. That is why it is worth checking the file in a browser and using search engine testing tools after you change it.

Who should own robots.txt in our organisation

Someone needs to be named as the owner, even if it is only part of their role. Ideally this is the same person who owns your sitemap, basic SEO decisions, and any Cloudflare or hosting controls.

If this page has you thinking about wider bot traffic and crawling concerns, keep an eye on the bot traffic and crawling hub, which will grow with more patterns for real world sites.

Next steps if you want help

If you would rather not wrestle with robots.txt, sitemaps, and bot rules yourself, this can sit inside a wider foundations review. That covers domains, DNS, Cloudflare, logging, and the basics of how search engines actually see you.

Request a short foundations call See consulting options Check accessible pricing