UC Berkeley School of Information, May 8-9, 2014
Workshop: Introduction to Date Science with Hadoop
Speaker: Michael Ernest (Cloudera)
Michael is defining "Data Science" rather narrowly as
The practice of evincing value from available data, but that probably makes sense when focusing on Hadoop in this workshop. For this reason, the tutorial is missing the angle of distribution and decentralization, assuming that all data is in one centralized location and then can be processed. So it really is more about
Big Data than
Starting with the question of
What is Hadoop? At the core, it is a combination of HDFS and MapReduce, augmented by an ecosystem of tools and complementary technology.
MapReduce as an analytics framework is not good for applications that depend on shared or global state.
As one of the developments in the Hadoop space, YARN (Yet another Resource Negotiator) does not reduce MapReduce's programming paradigm, but its job distribution framework (useful for large installations, in the range of multi-thousand node systems). It does not change how to write MapReduce code, it just changes how you run it.
Some parts of the Hadoop ecosystem are presented: Sqoop, Flume, Hive, Pig, HBase, Oozie. HBase (
I need high-speed, random CRUD on big data!) as a popular NoSQL tool, competing with resources with MapReduce, but more a data access than a data science tool. Oozie for workflow orchestration, mostly for coordinating processing steps.
Using Hadoop as a ETL appliance is the bread and butter business for some Hadoop companies. The goals of Hadoop and ETL are the same: Enabling analytics.
Summing up this workshop it was an interesting overview of Hadoop, but it would have been nice to position it a bit better in the bigger
Data Science field.
Panel: UC Berkeley Data Science Startups
Panelists: Josh Bloom (wise.io), Kuang Chen (Captricity), Joe Hellerstein (Trifacta), Ion Stoica (UC Berkeley)
Anno Saxenian starts by stating that startup funding goes through boom and bust stages, and currently we seem to be in a boom phase.
Josh Bloom says that finding right buyers is critical: nobody wants to buy tools; customers want to solve problems. Kuang Chen talks about improving data entry to then be available for analysis. Joe Hellerstein talks about about finding the inefficiencies of analysts and finding ways on how to improve the process: data cleaning, data wrangling.
Anno asks the panelists:
What will data science look like in 5 years? Josh claims that data scientists are unicorns: they don't really exist, but the discipline should be something where a person has deep skills in one field, and can converse about the other ones. What is needed are people that function very well in teams. Kuang says there should be more focus on value/bit, and in many cases that is data/content that people bother to write down, but then it gets filed away. Josh adds that people-generated data may increasingly be people-originated data from wearables. Ion claims that there is still a huge gap between how much data is already stored and available, and how easy it is to extract value from it. This means that you will see much more tools for working with data, and keeping track of that landscape will become increasingly important. Joe claims that the development of data science will be similar to computer science: Data science should become more accessible, so that more can use it, and making it more usable and accessible will be one of the big changes.
Summarizing, this was a really interesting panel given some insights how challenges in data-intensive fields allow a whole sector of new startups.
Panel: Venture Capital Roundtable
Panelists: Michael Borrus (XSeed Capital), Jake Flomenberg (Accel Partners), Adam Ghobarah (Google Ventures), Mamoon Hamid (Social+Capital Partnership)
As a starting point for what VCs are looking at these days, Jake Flomenberg points out many low-hanging fruits, such as government processes.
As a general guideline for how to be a good data-oriented startup, one simple way to go would be to publish more data, and make it available so that value can be extracted from it (rather than hiding it behind old-fashioned applications).
Adam Ghobarah points out how simplistic the current array of sensors and wearables is, and there probably is going to be a lot of development in this space. The biggest problem when it comes to finding valuable startups it to find some that have identified a valuable problem to solve; just being able to find new insights is not enough, if this is not extracting enough value.
Jake points out that one the one hand there is too much noise in the space, but there also are many unsolved problems. Adam Ghobarah pointing out that cross-cloud mobility is something where we'll see a need and products. Moving to the cloud replaces just one problem with the other, and being able to abstract from specific clouds will become important.
In summary, an interesting view into what drives VC funding, what some players are looking at and think of the current state, and what current hopefuls maybe should take into account when looking for focus and funding.
Keynote: Ethics and Data Science
Speaker: Jeff Hammerbacher (Cloudera)
I won't even start trying to summarize this talk. Funny, entertaining, thought-provoking, and very broad. If you ever have the chance of seeing Jeff speak somewhere, definitely go!