or the alternative title for this could have been
The Killer App for Linked Data. more about the killer app in a minute, first about the
not-killer-app, the hope that linked data will converge on a small set on vocabularies, and then become interlinked in the elusive Giant Global Graph.
at the Linked Data on the Web (LDOW 2011) workshop at WWW2011 this march, one of the most interesting discussions was about why there was no wider adoption of some core vocabularies. some people said it was just a problem that the right platform for sharing vocabularies had not yet been found.
however, in their excellent SIGMOD Record article
Semantic agreement comes at a cost, and that cost is driven both by the number of people who require a shared understanding, and the number of concepts they must all understand. The cost element appears to be the person-concept. We are aware of many examples of small numbers of participants agreeing on large, complex standards (e.g., meteorology) and of larger numbers of participants agreeing on modest standards (e.g., cursor on target), but we have seen few successes where large numbers of autonomous participants agreed to a large, complex standard.(page 47)
this is just empirical, but it's definitely food for thought: how realistic is it to hope that, even given some better platform to encourage the reuse of vocabularies than there currently is, this is going to happen? is the problem a mere get-vocabularies-to-the-people problem, or one of people-usually-do-not-agree-on-vocabularies? case in point (as usual) is HTML, which is crappy as a publishing vocabulary, but was probably successful because it is easy to understand, easy to use, takes a lot of bending without breaking, and delivers excellent value through the network effect.
now back to the title: from AI to BI? in the discussion following the problem of vocabulary adoption and thus data fragmentation, richard cyganiak, a very well-known linked data researcher from DERI, made a remark that surprised and delighted me. he said that the most linked data value usually is derived by dumping all data in a single triple store, and then working on that. this has been practiced for years now, with all the triple boasting of various linked data groups being a very good indicator, but it seemed to me that in most cases this was supposed to be a temporary fix until full distribution, decentralization, and interconnectedness could be achieved. it was great to hear richard saying that centralized data crunching was basically the main activity when it comes to using linked data.
what used to be called
data mining now seems to morph into something a bit wider labeled
data science; and what used to be
data warehousing now is the exact same thing but called
business intelligence (BI) (listen to hal varian explaining how essential all of this already is and will increasingly be). what i am suggesting is that BI is the killer application for linked data and RDF. for this to happen, RDF needs to shed all the semantics/reasoning bulk, something that could be done with the ongoing work around a revised RDF, but, as richard remarks, does not seem to happen (yet). this activity may be an important inflection point in RDF's history: become more powerful and sophisticated and even less accessible to the mainstream and the needs of average data management, or radically simplify and become the foundation for something that's accessible and usable for many people and applications.
imagine average BI providers understanding and applying linked data principles: it would work pretty much exactly the same way the popular linked data cloud is being produced: scrape existing sources based on often not-so-pretty back-end processes, sprinkle the individual parts with a bit of provenance info, dump everything in a triple store, and start SPARQLing away. refine that to be real-time and push-enabled all the way through, and you have the holy grail of BI: real-time BI!
the main beauty of this idea is twofold: first, and maybe most importantly, there is a ton of interest and money in BI. that matters a lot. second: since BI is not globally distributed but centralized, there is no need to worry about the vocabulary problem. make sure you know what RDF you generate from your data, maybe use some simple ontologies (or none at all), and then just use SPARQL on the raw RDF data.
i am wondering if and how much SPARQL as the core part in such a model of BI would need to adapt. maybe have simple datatype support in SPARQL and have a simple datatype vocabulary for RDF (such as XSD) so that the storage and queries could be better optimized for typed queries? is there anything else? and how does all of that sound as a killer app? i think this is much more realistic (both in the sense of
this actually works and
this might actually succeed) than expecting to be able to semantically
query the web.