when i wrote about being interested in
robots.txt files, that was only half of the story. actually, what i am really interested in are sitemaps files. the reason for this is that i am interesting in figuring out how to better use site structure metadata on the web. starting from the dataset of 361
robots.txt files (almost, some of them are HTML), the next question to answer therefore is how many of those are pointing to sitemaps files.
sitemaps files can be directly advertised to search engines (in which case there is no way how a third party could find out about them), or they can be referenced in a
robots.txt file. this method makes it easy to detect sitemaps, but unfortunately only those which are advertised in
of the 361 returned
robots.txt files, 38 contain the
keyword, some of them multiple times, so that in the end the dataset points to 70 sitemaps files. that is a little bit less than i was hoping for, but given the rather small initial dataset of 500 domains, maybe that's already a decent result.
most of the advertised sitemaps files (62 of them) can be retrieved successfully, and most of them are modest in size. only three are bigger than 1mb, and the biggest one (a porn site) is 2.2mb. however, these are only the measurements for the files which were directly pointed to by the
robots.txt files, so this is only an intermediary step.
sitemaps files can directly contain sets of URIs, or they can be index files, pointing to further sitemaps files. of the 62 retrieved files pointed to by the
robots.txt files, 35 are index files, and only directly 27 contain URI sets. the 35 index files point to 14'056 sitemap files (each of those will probably contain large sets of URIs), so this is where the explosion of few popular sites into many URIs finally happens. specifically, of those 14'056 sitemap files, 12'873 are specified by amazon (which is represented in the domain list by amazon.com, amazon.co.jp, amazon.de, and amazon.co.uk). apparently, amazon is a really big fan of providing really big sitemaps (all of them are provided as gzip'ed XML), and please keep in mind that 12'873 is the number of sitemaps provided by amazon, not the number of URIs.
looking at those number, this is definitely where the approach of quickly hacking XSLT and XML comes to an end. for future experiments, it will be necessary to use tools that are a bit better at handling large datasets than my XML editor and the file system...