when i recently started crawling for sitemaps, i somehow assumed that they would be publicly available. after all, they should be available for directing search engines how to better crawl a site. some of them are not even available, returning a 404 (Not Found) error. this is probably just a robots.txt
and a site configuration being out of sync, so i assume most of these cases are not intentional.
on the other hand, myspace's robots.txt
points to http://www.myspace.com/us_sitemap_index.xml.gz
which returns a 401 (Unauthorized) response. this is a bit more puzzling, since it looks like something intentional. HTTP has no way of communicating how to authenticate for this resource, so it is unclear if and how authentication via HTTP would be possible. (it would also be possible that the only accepted way of authentication would be an originating IP address of well-known partners).
does it make sense to access control sitemaps? maybe it does, because a site may not want to give away an almost complete list of all of its URIs for free. in particular for social networking sites, whose main value lies in the data they accumulate, it might make sense to not publish all user IDs via the sitemap. let's do a quick hypothesis check:
- facebook's
robots.txt
points tohttp://www.facebook.com/sitemap.php
, which redirects to facebook's home page via a 302 (Found) response. that could be a misconfiguration of the server, or it could be a more subtle way of access control, simply redirecting unauthorized clients to the home page instead of returning an error message. - friendster's
robots.txt
does not even point to a sitemap. this is safe, but may cause the site to be less accurately updated by some search engines. - MSN Spaces'
robots.txt
points to http://spaces.live.com/sitemapindex.xml (and this is suprisingly the only line in the robots.txt), which is just an index file. the index contains a large number of links to sitemaps files,http://spaces.live.com/SiteMap_20081006_041012_0_0.xml
is the first one and returns a 200 (OK) and a small list of URIs to user profiles. however, the index contains URIs of 7061 sitemaps files, and maybe some of these are access controlled.
it would be interesting to learn a little more about sitemaps are published and controlled on the web, and we are currently collecting some data about this (in a more systemtic way than the examples presented here). maybe MSN Spaces' openness about its profiles is simply an indication that Microsoft is still adjusting to get used to this whole web thing, whereas the younger companies have a better understanding of how the web works, and how to work with it.
Comments