Data doesn’t need to be big to be important

July 25, 2015 David Schmidtchen

In 2009, a team of researchers from Google published a paper in Nature titled, Detecting influenza epidemics using search engine query data. Working purely from search engine queries the researchers claimed to be able to track the spread of influenza across the United States faster and more accurately than Centers for Disease Control and Prevention (CDC). Google was tracking the outbreak by finding a correlation between what people searched for online and then mapping the progress of flu outbreaks. Google Flu Trends quickly became illustrative of the possibilities and power of ‘Big Data’.

The phrase big data has become so universal in its application that any and all data is now seen as ‘big’. Indeed, if I were to admit that I have small to medium data there is good possibly that I would be mercilessly mocked. There is so much hype around big data that voices that speak of the need for caution, that argue for more thinking and less marketing phrases, and that refuse to believe that techniques of scepticism and the scientific method are dead can struggle to be heard. In fact, using our known philosophies and techniques are even more important when confronted be the sea of data available to us today.

The initial failure of Google Flu Trends is also illustrative of the problems with too much hype around Big Data. For instance, the Google modelling was largely theory-free. Without a theory they were not systematically testing a hypothesis but rather dipping their bucket into the ocean and pulling out whatever happened to be within reach of the bucket. This was then said to be ‘representative’ of the ocean. The reasoning was they their bucket was so big that it must be representative. This under-estimates the actual size of the ocean they were messing about in. Additionally, and strangely enough, search terms like ‘flu symptoms’ or ‘pharmacies near me’ might not (in theory) be strongly correlated with the spread of the disease itself. Indeed, we have known for some time that correlation is not causation. A point that big data advocates seem to routinely overlook.

It turns out, that when compared to what actually happened on the ground the Google Flu Trends modelling was inaccurate to the extent that relying on it alone would have led those authorities responsible for managing flu outbreaks to waste time and resources. The good people at Google don’t give up that easily, so they have been refining their models and combining them with other data to produce more accurate mapping.

The term big data was originally used to refer to very large data sets, typically those of a size where the manipulation, management and analysis of the data presented significant challenges. Big data is not important because of its size but rather from the patterns that come from the connections that can be made between disparate pieces of data. And, this is where the refined Google Flu Trends data and modelling will eventually be valuable—as one more input in constructing a mosaic of what represents truth in relation to managing public health. The data is important but alone it does not yield the answer. A person needs to interpret, judge and act on the data for it to become meaningful. Big data attempts to put inert facts at the centre of decision making rather than people. Big mistake.

Also, despite all the hype, not every business has big data. Not every business even needs access to big data. In fact, very few fall businesses fall into the category where they have or need data where the traditional techniques of statistical analysis are no longer valid. Maybe that’s why a survey reportedly found that 55% of Big Data projects never get completed. They are not relevant.

The popularity of big data has led to some unfortunate conclusions to be drawn from (or attached to) the concept; for example:

The sheer quantity of data means that any analysis will produce uncannily accurate results.
That because we are capturing so much data those boring old statistical theories and sampling techniques are obsolete.
That it is old-fashioned to agonise about what causes what, because statistical correlation tells us what we need to know.
That scientific or statistical models aren't needed because well, ‘with enough data, the numbers speak for themselves’.
A new attitude within organisations means that by combining data from multiple sources will lead to better decisions

There is an underlying claim that runs hidden though all these claims—data is objective. And, the more data I have the more objective it must be. Unfortunately, this claim to objectivity is false.

All managers and researchers interpret data. This interpretation begins with way the data is imagined, it is present in the tools used to gather it, it is present in the analysis, and it penetrates our interpretation. Science has evolved strong methods to control for bias in our search for cause and effect; for example, hypothesis testing, repeatability and falsification.

The historian E.H. Carr made the following observation about facts in history that I think all business leaders and managers should remember.

The facts are really not at all like fish on the fishmonger's slab. They are like fish swimming about in a vast and sometimes inaccessible ocean; and what the historian catches will depend, partly on chance, but mainly on what part of the ocean he chooses to fish in and what tackle he chooses to use – these two factors being, of course, determined by the kind of fish he wants to catch. By and large, the historian will get the kind of facts he wants. History means interpretation.

Leaders and managers are fishing in an astonishingly vast ocean. We should be excited about the possibilities that come with big data but this does not render all that has gone before meaningless. We should be sceptical about claims that will instantly solve problems that have plagued business leaders and managers for all time.

Data is not objective. Humans will always be in the data.

People still need to form judgements and make decisions about the data that is captured and stored in machines but also from what they know and understand of the way the world works.

Most of all, not all data needs to be big to be important or useful. Small to medium data that is interpreted thoughtfully can still tell us a great deal about the world. In the rush to make everything big in order to create the illusion of objectivity, we are attempting to (not for the first time) remove the human from the machine.

Using and interpreting data requires people. Businesses seeking to maximise the opportunities of data (most which have always been available to them) will need to invest time in thinking about what they are trying to achieve with data, information and knowledge. They will also need to commit to investing in the people, techniques and resources required to deliver on their goals.

The philosophies, techniques and tools that we have developed to pursue truth remain valid. Indeed, I think with the rise of big data the practices that have served us well in the past are likely to be even more important in maximising the opportunities that might be available through big data.

Thanks for taking the time to read this post.

Sources:

CIOs & Big Data: What your IT team want to know, can be found here.

Google Flu Trends, can be found here

Should You Be Wary of Big Data Success Stories?, can be found here.

Detecting influenza epidemics using search engine query data, Nature, 457, 1012-1014 (19 February 2009) | doi:10.1038/nature07634, also available here.

The big data Wild West: The good, the bad and the ugly, can be found here

Big data: are we making a big mistake?, can be found here.

Six Provocations for Big Data, can be found here.

Photo credit:

Photo by Eric Constantineau - www.ericconstantineau.com - Creative Commons Attribution-NonCommercial License https://www.flickr.com/photos/74007022@N00