A Theory of Big Data

Last year, I spent some time with one of our largest customers talking about Big Data. Leaders there have deep ties to statistical analysis and research methods. They were after all, part of the psychometric old guard that had invented and refined some of the most advanced aspects of high-stakes testing. 


Big Data hype would lead you to believe it would cure cancer and create world peace. They were not falling for it.


At issue was the idea that your data could tell you something. 'Data doesn’t tell you anything, you have to ask it questions, you have to have a theory you are testing.’ I don't disagree with this science and statistical fundamental, and I certainly feel that the hype cycle on big data is nearing historical proportions. 


One thing we agreed on was that often data can be explored to develop interesting questions to ask. The recent "Ballghazi” scandal involving under-inflated footballs provides an interesting example of how this might happen.


Football data analyst Warren Sharp delivered a thought-provoking analysis of NFL Patriots performance on one game aspect that might have relation to football inflation: fumble turnover prevention. In a nutshell, the Patriots are better at this game statistic than any other team. But not just better as in top-of-the-bell-curve, better as in ‘off-the-bell-curve', literally off-the-chart.


Take a few minutes to read, and you’ll not only be treated to an interesting spelunking of NFL data, but you can see how data can be explored to identify interesting topics to further theorize about. Develop a theory for WHY the patriots are so good at this statistic, and you can go and test that. Against the data. 

UPDATE: There is a detailed critique of the original data analysis here. Further reminders about data interpretation, and not over-reaching on theories.