Menu

Big data and big statistical mistakes

Tim Harford has an excellent critique of the statistical issues with the “big data” trend in Big data: are we making a big mistake? First, there’s this:

But the “big data” that interests many companies is what we might call “found data”, the digital exhaust of web searches, credit card payments and mobiles pinging the nearest phone mast.

I still love the term “digital exhaust”. I first saw Frank Chimero use it in the context of social media when he said (in a post that’s now gone from the internet):

The less engaged I become with social media, the more it begins to feel like huffing the exhaust of other people’s digital lives.

But back to big data. The big problem (see what I did there?) is that statistical problems don’t just go away when you have more data. In fact, they get worse. For example:

Because found data sets are so messy, it can be hard to figure out what biases lurk inside them – and because they are so large, some analysts seem to have decided the sampling problem isn’t worth worrying about. It is.

The article goes into the detail on this, and I think it’s important for us to recognize the limitations of big data before jumping on the bandwagon.