after finding what may have been Stimulus Feed #1, we now have found one more, so we simply called it #2 (which of course may be completely wrong). this second feed has been made available the Department of Justice (DOJ)'s recovery web site. in this case, the feed is advertised through feed autodiscovery, so it can be found automatically once you've found the web site (how to distinguish between various feeds, if there were more than one, is a different issue that needs to be addressed when thinking about Stimulus Feed Details).
the interesting observation is that this feed does things very differently (but still according to the guidelines). it is not in great shape, but the feed validator only reports two errors. more importantly, the only entry's atom:link
points to a non-existing page (and does not have a @rel
attribute), but if you fix the development link to a guessed production link, it actually works.
the problem is that the entry (and the web page) is plain HTML. the web page is actually XHTML but is included in the feed as escaped HTML, which just makes processing a bit harder than it would have to be. but since it is well-formed XML, it can be safely unescaped and parsed. the actual web page is almost valid XHTML, but that one minor error does not really makes things hard and can be safely ignored.
what makes things hard and almost useless (from the viewpoint of somebody interested in the information) is that the HTML has no semantic markup at all. here is the markup for one of the table rows (and the table header row is not marked as such and only contains textual headers):
<tr>
<td valign="top">15</td>
<td valign="top">0402</td>
<td valign="top">OJP</td>
<td valign="top"> $2,765,000,000 </td>
<td valign="top"> $ </td>
<td valign="top"> $</td>
<td valign="top">1) The allocation for the Office for Victims of Crime Recovery Act Compensation and Assistance formula grant programs was finalized.</td>
<td valign="top">1) OJP intends to post its formula grant solicitations/announcements between March 6 and March 9, 2009.</td>
</tr>
this markup makes it almost impossible to extract the information from that page. compared to Stimulus Feed #1, at least in this case no proprietary data format must be parsed, but apart from that, it is hard to imagine how to reliably parse this information for automatically extracting information from it.
Comments