Data Science is a subject that is mentioned a lot, and many people are saying that they're doing it. but the term often seems to take on different meanings, and depending on who's referring to it, it may mean surprisingly different things. here is a modest attempt to identify two intertwined but still moderately clearly separated sides of
there of course is substance behind the term, even though it may have become a bit obscured by the recent inflationary use of the term. with pretty much every term that has become hype, it can make sense to first start with some basic definition before using it, so that at least in the context of a conversation, it is relatively clear what the term refers to.
when thinking about a good definition for
Data Science, and how it can be used in a somewhat well-defined way, it seems to make sense to separate two
views of what
Data Science encompasses. of course these two sides are related and connected, but they are fairly distinct, and it may make sense to clearly call out when only one of them is referred to (assuming that when
Data Science is used without further qualification, it refers to both of them).
here is an attempt to identify these two sides, and maybe the labels could be a bit snappier, but i am mostly interested in getting the discussion started, and getting a better understanding of what others think
Data Science is and isn't:
- Vertical Data Science (VDS): this side refers to methods around making sense of raw data, combining data, and making sense of the combined data (this iterative process can be repeated any number of times). this side has a big overlap with
Big Data, and in a corporate context, this is often referred to as
Business Intelligence (BI). the main aspect here is that ownership and management of the data is in one hand, and the main task is to use these large amounts of data for sense-making.
taking the proverbial needle in the haystack, VDS is all about methods to find needles quickly, in very large haystacks, and to be able to search for things that may not even be needles, but fit some interesting patterns (
finding something needly).
- Horizontal Data Science (HDS): given that VDS allows us to expose
BI servicesfrom one data owner/manager/needle-searcher, the question then is how to design a
Service-Oriented Architecture (SOA)around these services. REST is often used as one way to implement this vision. in HDS, the implementation of a single VDS provider is opaque; what matters is that VDS providers expose data-driven services that can be used to implement sense-making across an ecosystem of a variety of those providers.
taking the proverbial needle in the haystack, HDS is all about methods how to search across haystacks, and how to provide an ecosystem of haystacks where people get access to haystacks, but where haystack owners can also control and limit access to them, so that for example certain privacy aspects can be protected.
while this separation of VDS and HDS may be rough and incomplete, it seems to be useful to explain some of the large differences in opinion when it comes to explaining that
Data Science is and isn't. VDS is all about sense-making and processing, and is necessary to be able to deal with the deluge of data in many data-driven sciences. HDS is all about combining results and setting results in new contexts, based on the assumption that all participants in the larger science picture have the freedom to implement their VDS in which ever way they like.
i am certainly biased here, but it seems to me that more people are talking about VDS than about HDS. and that may be a reflection of the same picture that you find in data-driven corporate settings: while there are clearly identifiable stakeholders (such as product groups) that have to get the VDS side done, the incentives and the stakeholders to oversee HDS are less clearly defined, and this side of the picture thus sometimes gets lost.
i am wondering how many people would (a) agree that VDS and HDS indeed are two valid sides of the
Data Science coin, and (b) if they see that one is more prominently mentioned and/or pushed that the other. personally, i think that both sides are essential: science needs good methods to work in data-driven scenarios, but it also needs a solid foundation how to robustly combine those scenarios, so that new studies and experiments can be easily based on previously disconnected datasets and results.