Big data and predictive analytics-When is enough data enough?
“Many predictive analytics applications turn out not to need all of the data,” Berry said during his keynote talk at Predictive Analytics World. So the real task for data scientists et al. isn’t figuring out how to analyze all the available data; instead, it’s figuring out how much data it takes to see something worth noting. The bad news?
“There’s not a simple answer to that question,” Berry said.
However, testing the predictive model’s performance by incrementally adding more data can shed light on when enough is enough. For example, when Berry wanted to know the standard bid by travel agency partners for a specific hotel and specific customer, he began computing averages: The first two bids compared to the first three bids compared to the first four bids and so on until he hit a steady plateau at 100,000. If he kept going to 200,000 bids, the average would change, sure, but not enough to matter.
“That’s the way data tends to be: When you have enough of it, having more doesn’t really make much difference,” he said.
So if more data doesn’t matter, what does? “So many things,” Berry said. Working with clean data, doing unbiased sampling, hiring staff dedicated to data quality and creative thinking.
That’s right, there’s a big place in predictive analyses for those soft data science skills, such as figuring out what variables can make the model stronger or what new patterns might be discovered by combining different kinds of data together. Examples?
“Someone had to think of the idea of wind chill factor,” Berry said, before combining actual temperature and wind speed to reveal a new data point: What the weather will actually feel like.
More big data delusions
Berry wasn’t the only presenter who badmouthed the state of big data and predictive analytics. Karl Rexer, founder of the consulting firm Rexer Analytics, went so far as to suggest that the current crop of data scientists suffers from a bit of delusional thinking.
In his 2013 Data Miner Survey, respondents indicated that the size of data sets is getting bigger. But when Rexer asked them how many records are in a typical data set they use for analyses, “We get the same answer we got in 2007,” he said.
That’s not to say big data is a farce or to give short shrift to the interesting work some are doing in this space, he said. “But for the typical analytic predictive modeling/data mining/whatever-you-want-to-call-it project, I would say the overall sample size used for those data mining projects is not increasing.” Read more