Monday, July 8, 2013

"Very relevant to us!" - How Google Flu Trends Is Getting to the Bottom of Messy Data

by Nicholas Diakopoulos  |  11:00 AM July 5, 2013









Churning through, tabulating, and modeling millions of search queries every day, Google Flu Trendscan measure, a full two weeks before the CDC, the incidence of influenza-like illnesses (ILI) across the U.S. Any official response to a flu pandemic, such as vaccine distribution and timing, could be greatly enhanced with such an early warning. And while not billed as an ersatz measure, Google Flu has had an uncannily high correlation with the CDC's own slower, yet more assiduously produced estimate of ILIs.
Until this past flu season, that is. The algorithm drastically overestimated the actual flu rate, in some cases by a factor of almost two, according to a report in Nature News. It's still not 100% certain why it failed, especially since Google isn't speaking publicly about it just yet.
Big data systems like Google Flu are complex and unwieldy beasts. They can (and sometimes do) fail to give us the insights we think they should. They're temperamental, messy, and can break down when the data or model changes unpredictably. So as your business adapts to making more and more data-driven decisions, from managing supply chains to hiring the best employees, how can you be confident in your big data decision making process?
I spoke to Rajan Patel, co-inventor of Google Flu, and he explained the two strategies in their assurance process: algorithms that detect and mitigate aberrations in search frequency that might throw their estimate off, and people to get to the root cause of system failures so that biases get rooted out of statistical models. The algorithms manage most of the day-to-day sanity checks before releasing estimates to the public, and the deeper systemic investigations by people are sparked by abnormalities like the H1N1 outbreak in 2009 and this past winter's flu season.
The Google approach suggests a certain data vigilantism comprised of smart people wielding smart algorithms to act as sentinels against faulty inference. Big data vigilantism can help your company cope with two of big data's main issues: messiness and sampling bias, and ultimately help contribute to growing your confidence in wielding big data in your decision process.
Messiness
At the core of the issue with flu measurement (and most projects involving large amounts of data) is ambiguity; both in the intent of a search query, and in the sense that the reference rate from the CDC measures influenza-like illnesses, which might include non-flu ailments that cause fever, cough, or sore throat. Search terms directly relating to a flu symptom or complication are conflated between people who actually have the flu and those that are expressing concerned awareness about it — and CDC measurements mingle people who actually have the flu and those that are just expressing some flu-like symptoms. Trying to determine the actual flu incidence requires some careful disambiguation. This is one place where smarter algorithms may come into data vigilantism: pulling out the information that you actually want to measure from your big, messy pile of data.
Researchers at Johns Hopkins are tackling an even more chaotic source of data for measuring flu: tweets. But by implementing some careful linguistic reverse engineering, their algorithm is able to take into account the context of the tweet and disambiguate the meaning. For instance, they noticed that when the word "flu" was used as the subject in a sentence it suggested that the message was more likely to be about awareness than infection. This pattern, or "template", then gets encoded into the algorithm as something predictive of awareness tweets, along with many other linguistic templates. Using the smarter algorithm their system was able to filter out the awareness messages and focus on the infection tweets, scoring a correlation much closer to that of the CDC.
Sampling Bias
Microsoft researcher Kate Crawford points out another pitfall of working with big data: sampling bias. But this again is something that smarter algorithms can also help correct for, if you make the effort to understand the bias in your data and adjust for it in the algorithm. For instance, we know fromPew surveys that Twitter usage skews towards younger age demographics. Any flu measurement based on Twitter messages would necessarily entail that demographic bias. To correct for this, we need to know the age of each person sending a flu-related Tweet, but thankfully we can estimate that from data too! In fact, researchers at UMass Lowell are already working out the details of integrating age-estimates into flu prediction from social media. So with a little bit of investigation to understand the bias of a sample, we can often correct for it downstream with better algorithms.
The Importance of "Why?" 
Messy, ambiguous data and hidden biases underscore a growing need to hire and train data vigilantes to watch over and ask "why?" about our every interpretation from big data. Big data kitsch promotes a world of blissful ignorance in its focus on correlation without explanation. But the data vigilantes do need to understand "why", sometimes to debug a spurious correlation or systemic failure (like we saw with Google Flu Trends), and other times to be able to develop a smarter method to measure the thing that we really want to measure.
It can be tempting to use data as a crutch in decision-making: "The data says so!" But sometimes the data lets us down and that exciting correlation you found is just a by-product of a messy, biased sample. More advanced algorithms can sometimes help cut through the mess and correct the sample, and smart skeptics can help step back, reflect, and ask if what the data is "saying" actually fits with what you know and expect about the world. Hiring and training these data vigilanties as well as inculcating a healthy dose of data skepticism throughout your culture and team can only help bolster the quality of decisions you ultimately make.

No comments:

Post a Comment