or the alternative title for this could have been The Killer App for Linked Data
. more about the killer app in a minute, first about the not-killer-app
, the hope that linked data will converge on a small set on vocabularies, and then become interlinked in the elusive Giant Global Graph.
at the Linked Data on the Web (LDOW 2011) workshop at WWW2011 this march, one of the most interesting discussions was about why there was no wider adoption of some core vocabularies. some people said it was just a problem that the right platform for sharing vocabularies had not yet been found.
however, in their excellent SIGMOD Record article From Semantic Integration to Semantics Management: Case Studies and a Way Forward
, arnon rosenthal, len seligman and scott renner make the following empirical observation from their experiences in a large and decentralized environment:
Semantic agreement comes at a cost, and that cost is driven both by the number of people who require a shared understanding, and the number of concepts they must all understand. The cost element appears to be the person-concept. We are aware of many examples of small numbers of participants agreeing on large, complex standards (e.g., meteorology) and of larger numbers of participants agreeing on modest standards (e.g., cursor on target), but we have seen few successes where large numbers of autonomous participants agreed to a large, complex standard.(page 47)
this is just empirical, but it's definitely food for thought: how realistic is it to hope that, even given some better platform to encourage the reuse of vocabularies than there currently is, this is going to happen? is the problem a mere get-vocabularies-to-the-people problem, or one of people-usually-do-not-agree-on-vocabularies? case in point (as usual) is HTML, which is crappy as a publishing vocabulary, but was probably successful because it is easy to understand, easy to use, takes a lot of bending without breaking, and delivers excellent value through the network effect.
now back to the title: from AI to BI? in the discussion following the problem of vocabulary adoption and thus data fragmentation, richard cyganiak, a very well-known linked data researcher from DERI, made a remark that surprised and delighted me. he said that the most linked data value usually is derived by dumping all data in a single triple store, and then working on that. this has been practiced for years now, with all the triple boasting of various linked data groups being a very good indicator, but it seemed to me that in most cases this was supposed to be a temporary fix until full distribution, decentralization, and interconnectedness could be achieved. it was great to hear richard saying that centralized data crunching was basically the main activity when it comes to using linked data.
what used to be called data mining
now seems to morph into something a bit wider labeled data science
; and what used to be data warehousing
now is the exact same thing but called business intelligence (BI)
(listen to hal varian explaining how essential all of this already is and will increasingly be). what i am suggesting is that BI is the killer application for linked data and RDF. for this to happen, RDF needs to shed all the semantics/reasoning bulk, something that could be done with the ongoing work around a revised RDF, but, as richard remarks, does not seem to happen (yet). this activity may be an important inflection point in RDF's history: become more powerful and sophisticated and even less accessible to the mainstream and the needs of average data management, or radically simplify and become the foundation for something that's accessible and usable for many people and applications.
imagine average BI providers understanding and applying linked data principles: it would work pretty much exactly the same way the popular linked data cloud is being produced: scrape existing sources based on often not-so-pretty back-end processes, sprinkle the individual parts with a bit of provenance info, dump everything in a triple store, and start SPARQLing away. refine that to be real-time and push-enabled all the way through, and you have the holy grail of BI: real-time BI!
the main beauty of this idea is twofold: first, and maybe most importantly, there is a ton of interest and money in BI. that matters a lot. second: since BI is not globally distributed but centralized, there is no need to worry about the vocabulary problem. make sure you know what RDF you generate from your data, maybe use some simple ontologies (or none at all), and then just use SPARQL on the raw RDF data.
i am wondering if and how much SPARQL as the core part in such a model of BI would need to adapt. maybe have simple datatype support in SPARQL and have a simple datatype vocabulary for RDF (such as XSD) so that the storage and queries could be better optimized for typed queries? is there anything else? and how does all of that sound as a killer app? i think this is much more realistic (both in the sense of this actually works
and this might actually succeed
) than expecting to be able to semantically query the web
.
This sounds right to me. Every time I have done something non-trivial with Linked Data I have ended up working with local copies. Lately I've been wondering whether I even need to bother with RDF and SPARQL. Often all I really want to do is build graphs and query and analyze them in various ways. I can do this in a triplestore, but then I have to convert to RDF if I have some non-RDF data. Often data is in the form of a graph but isn't RDF. So why not just stick RDF and non-RDF-yet-graph-structured data alike in a graph store? And then I can write code to do the queries / analysis I want and don't have to worry about SPARQL.
Posted by: Ryan Shaw | Wednesday, May 04, 2011 at 20:25
Three short comments:
- this blog is the first time, where I read business intelligence & anayltics properly linked with the linked data environment. The key is here to me, that you can apply (quite easily or "not too far away" in terms of timing) traditional bi techniques to linked data contents. The much exciting stuff is to apply broader analytics, which would mean "generate new information based on linked data contents".
. Maybe a key to opening up linked data contents to traditional BI systems would be to create a LinkedDataOLAP-Adapter (an idea I had when looking http://www.simba.com/olap-sdk-features.htm ): this would enable instantly much higher adoption rates, as business users would access linked data contents without even leaving their desktop & application environment. (and without knowing that they used 'rdf-coded' information ;-)
- You cited @cygri's statement re. local copies of triples: that's today's state of technology. I still hope that we get to something like "federated sparqling", where the querying person does not have to know where statements about a resource are made. Today the infrastructure-wise effort to recreate the most important triple sources for your own use is too high, and the publicly & freely available infrastructure does not match up - today - to requirements as you have them in production environments)
A wonderful blog post, indeed,
(re. Ryan: basically right, but I believe in the technical advantages & beauty of SPARQL: but I do not believe that e.g. a large share of business analysts will ever learn sparql, therefore we should make connectiong with linked data sources easy in todays (web) applications, and we'll likely have to hide sparql queries behind nice UIs to get acceptance)
Posted by: Dakoller | Thursday, May 05, 2011 at 02:08
Business Intelligence and Linked Data are connected subjects in our world view re. Giant Global Graph of Linked Data. Intelligence is a function of being able to access and make sense of data across disparate data sources. This applies to individuals and their social networks just as it applies to same individuals within enterprise intranets. Of course, this also applies to enterprises (organizations which are Agents too).
No matter what moniker we apply to the subject matter in question, the fundamental value boils down increased Agility by surmounting the inertia of data silos. This is indeed the essence of the mater re. Linked Data since it facilitates data virtualization across heterogeneous data sources via a mesh of distributed data objects.
Key thing to note:
1. none of what I state is unique to RDF. RDF is simply an option. The power comes from URIs and EAV based linked data graph representation
2. reasoning is important and mandatory, but you need the linked data substrate in place first for the "sense making" prowess of reasoning to surface.
Some links to posts and resources that cover Linked Data and Business Intelligence (BI) from the past:
1. http://www.openlinksw.com/weblog/public/search.vspx?blogid=oerling-blog-0&q=business%20intelligence&type=text&output=html -- collection of posts from Orri Erlings blog data space
2. http://www.delicious.com/kidehen/virtuoso_sparql_tutorial -- collection of tutorials oriented towards making sense of data
3. http://ods.openlinksw.com/wiki/main/Main/VOSArticleBISPARQL2 -- business intelligence extensions for SPARQL (SPARQL-BI) .
Kingsley
Posted by: Kidehen | Thursday, May 05, 2011 at 05:44
@ryan, if all you want is graph storage and processing, RDF and SPARQL indeed may be a bit of an inconvenience. but if your to-be-graphified data has nodes that are identified by URIs, and might reuse certain relationship that also can be identified by a common set of identifiers (such as URIs), and if you want to have some built-in mechanism to separate namespaces and subgraphs in the graph, then RDF and SPARQL already may look much more appealing.
so i think for general graph processing you're right that RDF/SPARQL might not be such an excellent model and tool set, but for datasets that are derived from (more or less well-designed) web-centric datasets and services, i think it does have quite a bit of appeal. it still is painful to use for data that has more of an ordered tree angle to it or a regularly structured table angle, and i think these are the areas where both RDF and SPARQL might need a bit of evolution (or layered technologies) to make life a bit easier.
Posted by: dret | Thursday, May 05, 2011 at 17:40
@dakoller: federated SPARQLing is an active research area, but because this is very hard to get right and efficient (as decades of research in federated databases have demonstrated, which basically never got to anything functioning and reality turned to data warehousing as an ugly but practical approach), i wouldn't wait for it. you might get it working when expanding the "let's SPARQL a single RDF silo" approach to a "let's SPARQL a couple of RDF silos" approach, but if it should really work at web scale, you essentially need to build real-time crawling into the infrastructure you're building, and that is, to put it mildly, maybe not all that easy to get working.
but the point is that as with data warehousing, SPARQLing silos already delivers a lot of value, and as long as the data has some backlinks to its origin (conveniently provided by RDF), i am sure even in this a little less elegant setting there is a lot that could be done. i really like the RDF/OLAP idea, maybe these are two communities that really should talk to each other a whole lot more.
Posted by: dret | Thursday, May 05, 2011 at 17:48
re. Linked Data converging on a small set of vocabularies - although reuse of terms is desirable where appropriate (e.g. dc:title instead of my:title), to work in a global environment covering millions of different subject domains of varying levels of specialization, a large set of vocabularies is essential. I reckon in this context there is a definite role for some level of inference, like subclass/subproperty reasoning to tie those vocabularies together down through specializations. Otherwise we either compromise the potential utility of the data (by sticking with over-generalizations) or risk continuing the disconnect that the current one-API-per-site setup incurs.
re. local stores, federation etc - although a lot of the infrastructure kind of specifications are in place, we're still a way off being able to *fluidly* obtain, manipulate and republish data in practice. Even if everyone decided to publish RDF overnight (as distributed LD or in SPARQL stores), there'll still be a gaping void out there which needs filling with intermediary services and end-user tools, to make the stuff really useful. This stuff takes time :)
Posted by: Danny | Friday, May 06, 2011 at 06:06