When you open up a book, play a CD, or even watch a VHS tape you’re utilizing a method of storing data created by man. One of man’s greatest weaknesses has long been his incapacity to remember, to recall his past experiences, ideas, or memories. It has been the inability of man to collect information and store it in large accessible quantities within his brain that has led him to create technology such as writing to enable us to store information that we need or that we think others may need in the future. From the great library of Alexandria to British trade records to the library of Congress, data has been compiled throughout human history. Yet only in the past fifty years has data taken on a new meaning in the lives of billions.
Since the advent of the digital age we have looked at data in terms of bytes, gigabytes, etc. The growth of digital data was slow at first, but exponentially ballooned to quickly overtake analog data by the end of the twentieth century. With the growth of even stronger computers in the twenty-first century, data has become something that we no longer only directly create but something that we also create indirectly.
We are not strangers to this flood of information. For most of us it has become a vital part of our lives: from dawn to daybreak we are plugged into a constant stream of data from the world around us.
Yet many of us have only just begun to realize the enormity of the data that we create daily through accessing the news, tweets, and notifications that we deem so vital to our existence. The era of Big Data has begun in these digital footprints we leave every day. In their book “Big Data”, Viktor Mayer-Schonberger and Kenneth Cukier argue that, “In the spirit of Google or Facebook the new thinking is that people are the sum of their social relationships, online interactions, and connections with content. In order to fully investigate an individual, analysts need to look at the widest possible penumbra of data that surrounds the person – not just whom they know, but who those people know too, and so on.” We now are what we do.
The power of modern day computing and the extent to which we utilize computers in our daily lives allows for the creation and collection of more data than ever before. From a purchase with our credit card at a Starbucks to the logistics of a shipping UPS package thousands of bytes of information are collected and stored by different companies and agencies with different interests.
The availability of this data is allowing companies like Google to challenge the way the world operates. In 2009 the H1N1 flu outbreak caused a global health crisis. Government agencies needed information and fast. Usually agencies like the U.S. Center for Disease Control and Prevention (CDC) rely on information passed in by doctors, hospitals, and other members of the health industry around the world and the nation. However, infected individuals are often not pushed to go to the doctor’s office until their symptoms become severe. Hence the CDC receives delayed information about the spread of viruses like H1N1.
Incidentally Google had developed a system that could track the spread of the flu in the U.S. only weeks earlier, utilizing its vast amount of data from billions of search queries every day. This software allowed them to analyze their data and search for terms that correlated with past data about the spread of the flu. Google had identified 45 search terms that had a very strong correlation with the actual figures of the spread of the flu in past years. So when H1N1 emerged as a crisis, Google was able to harness the software they had created to give health agencies real time data on the spread of the flu weeks ahead of physicians and governmental data collections. After they explored this example, Mayer-Schonberger and Cukier elaborated on how Google’s actions are “built on ‘big data’ – the ability of society to harness information in novel ways to produce useful insights or goods and services of significant value”. This description details perfectly what big data is: a tangible by-product that we no longer knowingly create, but have to learn how to shape and mold.
Quantity vs. Quality
Data in the past was something that had to be sought out and collected. This was the purpose of tools such as censuses and surveys. The evolution of the collection of data is vital to examine in order to comprehend the magnitude of the change that big data brings about. In the late 1800s the U.S. census bureau faced a giant problem of the mounting time it took to carry out a census. The U.S. Constitution requires that a census has to be taken every decade. However, the 1880 census took 8 years to complete and the 1890 census was estimated to take over a decade to complete, rendering them virtually null and void. This swamp of information that the U.S government and also companies across the world were swimming in at the end of the 19th century was an enormous problem that was solved by the advent of computing.
With conventional sampling, quality of data for analysts is an absolute necessity. This is what the U.S. census began using to procure its data in the early 1900s. The invention of new mathematical models for sampling analysis allowed agencies to realize that a sample size of only 1,100 was needed to achieve percentage error of only about two or three percent. This concept has been the basis of all statistics and data collection since its conception; the idea that a small bit of data could be extrapolated to represent a much larger set was outstanding. Over the decades analysts have developed increasingly advanced methods to compensate for margins of error in sampling, attempting to reduce the error to as small as possible so that the data is potentially more useful.
However, this idea of absolute quality of data is being challenged in the modern era where only the quantity of data matters and not the quality. As Cukier and Mayer-Schonberger put it
Essentially, when quantity is so great the number of incorrect data points is small and insignificant versus the correct ones so the margin of error is extremely small. This process is all about efficiency. We understand that inexactitude exists, but it’s more efficient to just analyze the complete data set with the inexactitudes. We have to learn to live with the fact that the data isn’t perfect, but by using such a large quantity the imperfections are insignificant.
For example, in determining inflation rates governmental agencies have to call companies and stores across the nation to determine the prices for everything. This allows them to create the Consumer Price Index (CPI) based on the rising or falling prices. This list used to take months to compile, making the data old, but extremely neat and very accurate in terms of the sampling process. However, the age of the CPI number often crippled companies and agencies abilities to take action with salaries and other inflation-based assets.
Hence two economists at the Massachusetts Institute of Technology, Alberto Cavallo and Roberto Rigobon harnessed the power of big data in order to turn out a “real-time” CPI. By analyzing tens of millions of internet prices they were able to create a rough CPI. This process included the inconsistencies in prices that abounds on the internet, but by utilizing such a large data set the overall trend of inflation was demonstrated through their calculations.
The messiness that goes hand-in-hand with quantity is also demonstrated in such applications such as web page Facebook likes. When the number of likes is small, each additional like is displayed by a numerical increase of ones. However, as the number gets into the thousands Facebook begins smudging the numbers with estimates like 4K, 10K and 150K. When a page has four thousand likes, each individual like doesn’t matter as much, but large incremental increases still do, so accuracy is sacrificed at the larger quantity. No one really cares about the difference between 4000 and 4010 likes
One of the biggest effects of big data is the way it has shaped our perception of correlation vs. causation. Mayer Schonberger and Cukier describe correlations
This creates many useful applications in the real world. One example was demonstrated by Amazon in the early years of Amazon books. Amazon first dreamed up the idea for recommending more books based on a customer’s purchase. However, this yielded mediocre results as it only suggested very similar items. As such technicians at Amazon dreamed up a new system of “item-to-item” filtering where the computer analyzed the broad database of products and identified similarities between products and combined it with purchase data. This yielded the system that we see today when we go to Amazon. The computer doesn’t know why someone buying a desktop computer might also like to buy a monitor, but it makes the suggestion anyway, yielding huge sales. The system evolved to know what to suggest, not why, and we have to come to realize that this is the essence of what big data can give us.
Cukier and Mayer-Schonberger also examine the case of hospitals. When computer scientists have collected large amounts of information that is often thrown away after patients leave or pass away at hospitals, anything from spikes or dips in monitors and sensors, they have been able to use computers to analyze these large data sets to discover new correlations. In the case of premature babies sensors capture over 1,260 data points per second about the new child. By compiling past data, computers have been able to determine certain correlations between sensors and the baby’s condition, such as that constant vital signs often precede an infection. Doctors may not know why this occurs, but it doesn’t necessarily matter at the moment, as the correlation can save thousands of lives before finding out why it happens
The Essence of Big Data
“The Petabyte Age is different because more is different. Kilobytes were stored on floppy disks. Megabytes were stored on hard disks. Terabytes were stored in disk arrays. Petabytes are stored in the cloud. As we moved along that progression, we went from the folder analogy to the file cabinet analogy to the library analogy to — well, at petabytes we ran out of organizational analogies.”
This new age marks the death of the scientific method. The scientific method has been the staple of not only science but every field involving data for almost all of human history. We have always first identified a problem and then sought out a solution, usually through the collection of data. The advent of the “Petabyte Age” means that so much data is available that only the questions need to be found. The scientific method has been reversed in our strange new world where we don’t know what to ask and we don’t know for what to look. Google processes one petabyte of data every 72 minutes, equivalent to the DNA of the entire US population times three. That means that given that is a constant rate Google will have processed an exabyte of data in 10 years. However, the amount of data being processed is only growing exponentially meaning we will probably see that amount reached much sooner. This evolving “deluge” of data will increasingly cause us as individuals to reexamine how its forcing us to change the way we think.