The hype, benefits, and dangers of Big Data

A Readlist of all the articles referenced in this post is available here. Readlists allow you to send all the articles to your Kindle, read them on your iOS device, or download it as an e-book.

Despite the overly alarmist title, Andrew Leonard’s How Netflix is turning viewers into puppets1 is a fascinating article on how Netflix uses Big Data in their programming decisions:

“House of Cards” is one of the first major test cases of this Big Data-driven creative strategy. For almost a year, Netflix executives have told us that their detailed knowledge of Netflix subscriber viewing preferences clinched their decision to license a remake of the popular and critically well regarded 1990 BBC miniseries. Netflix’s data indicated that the same subscribers who loved the original BBC production also gobbled down movies starring Kevin Spacey or directed by David Fincher. Therefore, concluded Netflix executives, a remake of the BBC drama with Spacey and Fincher attached was a no-brainer, to the point that the company committed $100 million for two 13-episode seasons.

The article also asks what this approach means for the creative process, something I’ve written about before in The unnecessary fear of digital perfection, so I won’t rehash that argument here.

What’s interesting to me about the rise in Big Data approaches to decision-making is the high levels of inaccuracy inherent to the analysis process. Of course, this is something we don’t hear about often, but Nassim N. Taleb recently wrote a great opinion piece about it for Wired called Beware the Big Errors of ‘Big Data’, in which he states:

Big-data researchers have the option to stop doing their research once they have the right result. In options language: The researcher gets the “upside” and truth gets the “downside.” It makes him antifragile, that is, capable of benefiting from complexity and uncertainty — and at the expense of others.

But beyond that, big data means anyone can find fake statistical relationships, since the spurious rises to the surface. This is because in large data sets, large deviations are vastly more attributable to variance (or noise) than to information (or signal). It’s a property of sampling: In real life there is no cherry-picking, but on the researcher’s computer, there is. Large deviations are likely to be bogus.

He gets into more detail on the statistical problems with Big Data in the article, and his book Antifragile looks really interesting too.

Since I haven’t written about Big Data before, I also want to reference a few articles on the topic that I enjoyed. Sean Madden gives some interesting real world examples in How Companies Like Amazon Use Big Data To Make You Love Them2. But over on the skeptical side, Stephen Few argues in Big Data, Big Deal that “interest in big data today is a direct result of vendor marketing; it didn’t emerge naturally from the needs of users.” He also makes the point that data has always been big, and that by focusing on the “bigness” of it, we’re missing the point:

A little more and a little faster have always been on our wish list. While information technology has struggled to catch up, mostly by pumping itself up with steroids, it has lost sight of the objective: to better understand the world—at least one’s little part of it (e.g., one’s business)—so we can make it better. Our current fascination with big data has us looking for better steroids to increase our brawn rather than better skills to develop our brains. In the world of analytics, brawn will only get us so far; it is better thinking that will open the door to greater insight.

Alan Mitchell makes a similar point in Big Data, Big Dead End, a case for what he calls Small Data:

But if we look at the really big value gap faced by society nowadays, it’s not the ability to crunch together vast amounts of data, but quite the opposite. It’s the challenge of information logistics: of how to get exactly the right information to, and from, the right people in the right formats at the right time. This is about Very Small Data: discarding or leaving aside the 99.99% of information I don’t need right now so that I can use the 0.01% of information that I do need as quickly and efficiently as possible.

What I think we should take from all of this is that our ability to collect vast amounts of data comes with enormous predictive and analytical upside. But we’d be foolish to think that it makes decision-making easier. Because Big Data does not take away the biggest challenge of data analysis: figuring how to turn data into information, and information into knowledge. In fact, Big Data makes this harder. To quote Nassim again:

I am not saying here that there is no information in big data. There is plenty of information. The problem — the central issue — is that the needle comes in an increasingly larger haystack.

In other words: proceed with caution.

Link via @mobivangelist ↩
It’s interesting that the phrasing of both this headline and the Netflix one implies that companies are using Big Data to persuade us to do things against our will. But I can’t figure out if that’s a real fear, or just clever linkbait. ↩