Robots.txt is an interesting file. For years it has been exploited by hackers, pentesting tools writers, crawlers, web scrapers, SEOs, etc. I was thinking the other day what sort of data robots.txt stores today. It’s been a few good years since I looked at some examples and decided to do some digging. I downloaded a few top1m domain lists, post-processed them to only look at TLDs (and SLDs where necessary), and kicked off a lengthy process of downloading it all.
At the time of writing it’s still running, but I already have some interesting findings to report from the first 100K domains:
- The number of domains that went dead over last few years is crazy. Many domains listed once in Alexa 1M are no longer there today.
- The number of websites that don’t use robots.txt is staggering. I was really shocked how many don’t use it all. One can argue that it’s not necessary, but if you can use it to manage legitimate crawlers… why not?
- The number of domains that redirect to some random stuff when robots.txt is requested is yet another phenomenon. As a results many downloaded files are just junk HTML pages.
- The number of sites that include server-side programming snippets in the output is also very interesting; you can literally see PHP code present inside the downloaded pages. Not a good security hygiene right there.
- Interestingly, some of the leaked snippets are clear SSO tactics to inject links to some sites ONLY when the user-agent is googlebot — most likely malicious SSO tactics at work.
- The number of sites that are actually most-likely-pnwed is also surprising. Apart from the aforementioned malicious SSO snippets, browsing through downloaded html pages reveals many instances of the very same ASCII Art hidden inside comments on many unrelated sites; it could be just simple hactivism, vandalism, but it somehow got there.
- The commented entries, error messages, or entries clearly introduced using a web-editor (contain HTML tags) are an interesting read too.
- The length of some of these files (listing hundreds, thousands of entries) shows that authors don’t know what wild cards are, or what the purpose of the robots.txt really is 🙂
- The file is offered in various encoding: ASCII, UTF8, UTF16 – even if the semi-official agreement is that it should be either ASCII, or UTF8.
- Another Localization fun fact: many ASCII robots.txt files often include non-ASCII characters e.g. in German, French, Russian, Chinese 🙂
Quite frankly… it’s quite a mess. One that only human could make 😉