The Revolution Will Be Data-ized
By Kenneth Cukier
Introductory remarks to "Real World Applications" panel
O'Reilly's Strata conference, February 2, 2011, Santa Clara, California
* * *
Good afternoon. Welcome to the "real world" -- the "real world applications" session. What goes on outside this room is about technology, tools and platforms. They are still in state of being defined and created. Here, it is about how these tools are put to use by businesses to improve what they do.
Where Washington, DC is defined as "a city bounded on all sides by reality," so too Silicon Valley. It is not where technology is put into practice -- it is where technologies are born; where they come from. It is the vendors. The place where the technologies are most richly exploited is everywhere else. So how businesses work with big data to solve business needs is the topic of the session.
Two days ago at the "data camp" on the eve of Strata, Ken Krugler of Bixo Labs described what he regarded as the world's first big data problem: the US census of 1880. It took seven years to calculate, so long, that the census bureau needed to find a new way to crunch the data -- which ushered in the Hollerith Machine a decade later (the precursor of IBM). But I would suggest that the issue of Big Data dates back much farther still. Certainly it should date back to the 3rd century BC when Assyrian clay writing tablets became so numerous that early librarians would affix a small clay label to the basket in which they were stored, noting the author and the work -- the invention of metadata.
Later, as Gutenberg's printing press was established around 1450, printing didn't immediately take off. Two other inventions were needed: pagination and indices -- to help organize the explosion of textual content and make it useable. In other worlds, an early version of MapReduce.
Which is all to say, these problems have been with us for a long time.
In this respect, it makes little sense to think of "big data" -- and what it means, and its potential -- strictly in terms of its size. After all, size is relative and all periods have had to suffer under information overload. Still, there is clearly something new happening here. So what can we say about it, to get at the heart of what's going on. Here is how I think about the idea of big data:
First, it's big. We can define it by its huge volume, even if this shouldn't be the main criteria. Second, it's fast -- fast in processing power and fast in terms of velocity or its real-time nature. Third, it's smart -- we can do clever things with it. Fourth, it's messy -- lots of the data are unstructured. Fifth, it's revealing.
These aspects of big data seem to separate what is novel from what came before. Consider each:
* Big. Google's translation system works because it treats translation not as an artificial intelligence problem of teaching a computer the rules of grammar and syntax, but as a mathematic problem, trying to score the probability that a world in one language is a suitable substitute for a word in another. It works better than all other previous systems not because Google's algorithms are smarter, but because Google has more data. The scale of the data enabled Google do something that couldn't happen before.
* Fast. On one hand, high-frequency trading and real-time analytics are examples. But the decrease in processing time is breathtaking. A year and a half ago, Visa was able to reduce the processing time on two years of transactions from one month to 13 seconds using Hadoop. Now, we don't what valuable new insights came from this. But we have to assume that being able to crunch data so quickly will yield something interesting and useful.
* Smart. The business intelligence systems are getting smarter. Most mobile operators are able to control churn by knowing the biggest predictor for why subscribers switch, and how to prevent them from doing so. Doctors are able to spot the onset of infections in premature babies before the symptoms exist, by running the the vital signs through their algorithms -- and by spotting it early, they can take more effective action, which saves lives.
* Messy. Twitter messages seem a big jumble. But one hedge fund is doing sentiment analysis to trade on the information in aggregate.
* Revealing. When the nearby city of Oakland publicly released police data, Stamen Design made a website, Oakland Crimespotting. A few clicks of a mouse reveals that the police sweep a major boulevard for prostitution starting at one end and moving to the other, and never make arrests on a Wednesday. When the city agreed to make the data open, no one ever imagined it would reveal confidential police tactics.
All of these examples point to one thing: big data tells us things, if we are clever enough to know how to listen. In fact, what makes big data so special is that it lets us spot things that otherwise cannot be seen by the naked eye alone -- such as the onset of infections in premature babies and the like.
In other words: real world things. From these examples, a working definition of big data can be attempted. Here's one that I'll try:
"Things you can do at a big scale that you fundamentally cannot do at a small one, to extract new insights or create new forms of economic value, in ways that change markets, organizations, the relationship between citizens and governments, and more."
With this as a base, we have an extraordinary panel of speakers to describe what they are doing with data. Let me introduce them.
# # #