after blogging about Stimulus Feeds Done Right, i received a couple of comments saying that the post was mostly complaining about the Initial Implementing Guidance for the American Recovery and Reinvestment Act
guidelines, but not really saying what to improve specifically. well said, i had to admit, so here is a more concrete approach at how to do stimulus feeds right.
- Finding Stimulus Feeds is important, and instead of setting up a central registry (which still might be a good idea), it should be possible to discover stimulus feeds in a predictable way. page 58 requires all agencies to set up
recovery
web pages athttp://agency.gov/recovery
, and there are a number of requirements for these pages. feed autodiscovery should be added as a requirement for this web page, so that going to a agency'shttp://agency.gov/recovery
page and searching for a linked feed is a reliable method for discovering an agency's stimulus feed. - the guidelines seem to require three feeds, the major communications feed, the formula block grant allocation feed, and the weekly report feed. it is not quite clear, though, whether agencies are free to have just one feed carrying three types of entries, or whether three separate feeds are required. from an information dissemination point of view, it would probably be better to require three separate feeds (and maybe provide a fourth one aggregating all three). in that case, feed discovery would have to be more specific and make sure that all three feeds can be discovered from an agency's recovery web page.
- a feed can contain the information (in what format? more on that later) or can point to a file containing the information (in what format? more on that later). if the feed points to a file, how is it supposed to do this? it could do so via
atom:link/@src
,atom:content/@src
,atom:content[@type='xhtml']//html:a/@href
, or, god forbid,atom:content[@type='html']//html:a/@href
. ideally, links should not be allowed at all, but if they are, they should be required to useatom:content/@src
. - agencies are allowed to provide no feeds at all, they can just publish the files via predefined URI structures. however, since it is unclear if and how a web server will make the directory of such a set of files available, it will be impossible to reliably discover and retrieve files with such an approach. at the very least, agencies incapable of producing feeds should be required to provide machine-readable directories on all levels of all posted files at the suggested URIs. there should be a required and machine-readable format for these directories.
- the guidelines specify the feed format as
preferred: Atom 1.0, acceptable: RSS
, and do not even mention RSS versions. this means that feeds can use 10 different formats, Atom plus the 9 different RSS variants. while it would be best to only allow Atom, it would be good to at least limit RSS to specific versions, such as RSS 2.0 (which probably should be specified to be RSS 2.01 rev 2, according to mark pilgrim's versioning). - since the feeds are allowed to contain only links instead of the actual data (essentially turning them into a notification mechanism), there also is a file format for how to publish the information. unfortunately, the templates for this file format are not publicly available. it would be important for this file format to be easily accessible with generally available tools, which limits the choices to plain text and XML. if XML is used, it should be plain XML and not some complex format. if more complex formats such as XBRL are required, XSLT transforms for up- and down-translation between these more complex formats and a plain XML format should be provided.
- starting on page 55, the guidelines mention data elements the feeds should include. it is unclear how these should be included (if the feed points to a file, the file format probably contains those data elements, but if the feed is supposed to contain those data elements, there must be a feed-oriented syntax for them). in addition to that, the datatypes seem to be copied straight from a SQL database scheme. there should be one required syntax for these data elements, and it should be based on XSD datatypes. implementation variants for this syntax are plain XML or microformats; if microformats are chosen, RDFa should be used.
this is not all there is to it (and without seeing the templates for the files it is hard to make specific recommendations for that part), but it should point in the right direction. i really hope that the current suboptimal guidelines will not make people believe that feeds necessarily result in a rather unpredictable hodgepodge of how information dissemination is organized. it actually does not take that much effort to come up with guidelines that create a robust and predictable landscape, and providing test data, test tools, and validators would help agencies to conform to those guidelines.
the good news is that the current guidelines (published 2/18/2009) specifically mention that more detailed guidance will be published within 30-60 days. which means that some of the above issues may still show up in the final guidance. this also means that it makes sense for agencies to wait until the more detailed guidelines become available.
Hi Erik, great to find folks like you talking about what recovery.gov is trying to do with feeds. I'd like to hear more about your feed autodiscovery ideas. As it is, doesn't a simple URI convention to GET an Atom service document help get us where we want to go, even without a 'registry'? Having said that, the combination of 'guidance' and recovery.gov could perform that function, even if we can't autodiscover feeds, but at least know where to introspect.
All of the 'use RSS or Atom' and 'can contain or point to' is more an acknowledgment of what we expect or don't expect reporting entities to do now. We know there are lots of versions of RSS, but most folks seem to have more awareness of RSS than Atom, even if they have no idea what RSS flavor they're using.
Based on initial and preliminary feedback, usually not relating to the actual techniques and technologies being suggested like yours, shows that some, if not most, would seem to want or expect recovery.gov to do all the work, and be the only APP server. Others are perhaps more ready to provide their own capabilities, and ask the right questions, like you're doing, which helps us gain clarity. Ideally, the original vision was to enable 'transparency at the source', and consider the information providers the gold source of that data, so the less recovery.gov processing the better, for a variety of reasons. We may have to take the fork in the road and do both, or at least provide everything anyone would need initially. We also hope to be able to share open source reference implementations of anything we do stand up on recover.gov.
For the former 'do it all for me because commodity standards based web infrastructure and skills is too hard or redundant across gov and therefore bad' crowd, the idea thus far is to bind a spreadsheet and/or and XForm to XSD datatypes, such that they either put/post to their own or recovery.gov provided APP services, that would parse/transform/publish in whatever way is most expedient. Office productivity spreadsheets and XForms can be edited offline and published to such service, or XForms can be served online as well, submitting directly to the parse/publish/persist/view/whatever services.
For the later 'we've got the chops and the infrastructure, just tell us how you like it' folks, we're heading toward a desired XHTML+RFDa set of markup standards, that would allow us to consider the web page as the web service (consumable by humans with browsers and parsers creating triples), the published resource as the public record, and the entries of the feed resource as recordest state changes, suggesting a feed organization around what we're actually tracking (grants/loans/contract awards) from various reporting entities receiving stimulus funds (fed/state/local/tribal gov agencies and large/medium/small businesses!) throughout a stabilization/stimulus/recovery/growth lifecycle with cost/schedule/performance indicators, over a feed organization around milestone/lifecycle centric reports, which seems to be consistent with what you are suggesting, however there's a good bit of information architecture to get worked out here yet.
We'd like the XML based on the XSD's provided to be be transformed to reflect an overall graph based (RDFS probably) data model (early in development) that begins with a large number of existing systems that we'd like to ultimately make Linked Open Data enabled SPARQL endpoints to ease correlation across disparately owned/operated/managed graphs (other LOD/SPARQL exposed DB's) as they also represent a large number of relational schemas that must be integrated but are unlikely to be normalized without extreme coordination cost - then there's lots of existing domain taxonomies/ontologies (like XBRL with FM concepts/terms tags, and I think SIOC is the kind of data model we want that tracks the dollar instead of the person across disparate web sites) that may be useful in building a federated graph-bridge capability without instantiating 'one ring to rule them all' (yet another) database, which is what most seem to assume or recommend that we do.
I hope this helps give more ideas about what we're thinking - as you might imagine the stakeholder set is extraordinarily large, so it's not always easy to progress in an 'agile' fashion. Feel free to reach out to me george at thomas dot name as you like, I'm sure I could learn from you, and would be grateful for that.
Posted by: George Thomas | Friday, February 27, 2009 at 12:24
@george: i just recovered your comment from the spam folder, thanks for pointing me to it, and i am sorry it ended up in there.
i think the general idea to try to stay away from "one ring to rule them all" centralized architectures is very important. the other important question is whether the overall data model should be based on RDFS or plain XML. i would argue that outside of the semantic web community, tools and know-how for working with RDF (let alone SPARQL) are not very widespread, a point bob glushko and i tried to make in our "XML Fever" http://dret.net/netdret/docs/wilde-cacm2008-xml-fever article. personally, i like the picture you're painting here, minus the RDFS/SPARQL part. i would argue that if it can be done with plain XML, it should be done with plain XML. if it requires sophisticated ontologies and advanced reasoning, then it might require RDFS/OWL, but that would be a decision that should not be taken lightly.
Posted by: dret | Saturday, March 14, 2009 at 15:40