Google’s flu fail shows the problem with big data

Published October 25, 2013 |

arvindl

When people talk about ‘big data’, there is an oft-quoted example: a proposed public health tool called Google Flu Trends. It has become something of a pin-up for the big data movement, but it might not be as effective as many claim.

The idea behind big data is that large amount of information can help us do things which smaller volumes cannot. Google first outlined the Flu Trends approach in a 2008 paper in the journal Nature. Rather than relying on disease surveillance used by the US Centers for Disease Control and Prevention (CDC) – such as visits to doctors and lab tests – the authors suggested it would be possible to predict epidemics through Google searches. When suffering from flu, many Americans will search for information related to their condition.

The Google team collected more than 50 million potential search terms – all sorts of phrases, not just the word “flu” – and compared the frequency with which people searched for these words with the amount of reported influenza-like cases between 2003 and 2006. This data revealed that out of the millions of phrases, there were 45 that provided the best fit to the observed data. The team then tested their model against disease reports from the subsequent 2007 epidemic. The predictions appeared to be pretty close to real-life disease levels. Because Flu Trends would able to predict an increase in cases before the CDC, it was trumpeted as the arrival of the big data age.

Between 2003 and 2008, flu epidemics in the US had been strongly seasonal, appearing each winter. However, in 2009, the first cases (as reported by the CDC) started in Easter. Flu Trends had already made its predictions when the CDC data was published, but it turned out that the Google model didn’t match reality. It had substantially underestimated the size of the initial outbreak.