now that it becomes increasingly apparent that Atom will succeed as a general-purpose format, it also becomes more interesting to think about the world of feeds. there is a very rich body of research around web metrics and web characterization. this is messy on the web, because sampling and categorizing web pages is hard enough. but at least web pages are typically well-connected (not as well as hypertext enthusiasts might wish, but still good enough to be able to crawl the web by simply following links), even though choosing a good seed set of pages to start with and good conditions to stop crawling because of link spam can be a challenge.
doing the same things for feeds would be very interesting, but more challenging. because feeds are different. feeds usually have a rather lonely life, because they are not linked in the web page sense of the word. and they may be discoverable, but many feeds are not made available by autodiscovery but only through regular links. and many feeds are not linked to at all, you simply have to know their URI to find them. and for feeds which are results of query services, a feed is a looking glass into the deep web, but probably one that only shows a very limited part of the underlying data. furthermore, feeds change a lot, so you definitely don't want just a snapshot of a feed, but some analysis over time.
even if there was a large of feeds URIs available, it would be difficult to decide what a representative sample is. for example, large services such as flickr or twitter routinely publish feeds, so these platforms alone are hosting millions of feeds. but all of these feeds look identical (apart from the content, of course), because they are all generated by the same application. the interesting thing, however, would be to find out what typical feeds
on the web look like (and please don't ask me to define typical
). questions that would be interesting to answer would be statistics for the following things:
- feed formats (RSS versions and Atom)
- use of feed extensions (podcasts, GeoRSS, paging and archiving, threading, licensing)
- feed size (# entries, size of entries, size of feed)
- usage of feed/entry features (statistics about element/attribute usage and values)
- feed update statistics (new entries per hour/day/week, distribution of updates)
all of this is probably a lot to ask for, but i think many people would be very interested in this data. maybe such a study has been done and i am not aware of it? another question i am asking myself: 80legs provides a pretty interesting programming environment for web analytics, but do they also store feeds? and if they do, are those feeds updated? probably not, but 80legs might at least by a pretty useful starting point to find those feeds that are somehow discoverable from web pages.
80legs does not store feeds right now, but we make it pretty easy for people to discover any feeds that are discoverable from web pages. We'll be very interested to see how this area develops.
Posted by: Brad Wilson | Wednesday, June 10, 2009 at 19:22