Is ‘Big Data’ About What We Do With Our Data Not How Much Of It We Have?
What is it about “big data” that resists definition? Today we have myriad competing definitions that each attempt to circumscribe just what it is we mean when we talk about the idea of using “big data” to understand the world around us. The notion that the size, speed or modality of data warrants such a label falls apart when we recognize that every Google search involves analyzing a 100-petabyte archive using hundreds of query terms. Instead of referring to the size of our datasets, could “big data” refer to the way in which we utilize our data, regardless of its size?
The question of just what constitutes “big data” has become a perennial point of debate in the digital world. Typically, most definitions relate to the characteristics of the data being analyzed, but such definitions become increasingly strained when we recognize that the most mundane of internet tasks, from conducting a Google search to querying Twitter all involve processing enormous volumes of multimodal material that is growing rapidly.
Using the example of a Google search, it seems absurd to label every Web search a “big data analysis” merely because it examined 100 petabytes using hundreds of parameters.
Yet what differentiates a keyword Google search from an SQL query of a data warehouse of the sort that is routinely described as precisely such a big data analysis? Does a keyword search written as an SQL query count as big data where a keyword typed into a Web page does not? Does an SQL-computed histogram count or does it take at least a linear regression?
Does using an SQL query to count how many records there are in a ten petabyte database count as a big data analysis? What about a summation or field extraction?
Where do we draw the line between ordinary data and “big data?” Does that boundary depend on the industry in which we work? To the Google’s of the world, petabytes are passé. In the arts, humanities and social sciences, datasets of hundreds of megabytes are still often referred to in the literature as “big data” analyses and genuinely reflect in some fields datasets far larger than those ordinarily used.
Does it matter whether we are the ones storing or analyzing that data or whether it is outsourced? If an enterprise manages tens of petabytes of desktop backups in its own data centers, there are very real complexities to the management of large datasets. At the same time, today there are plenty of vendors that sell turnkey petascale storage systems complete with onsite representatives to manage and service the units. Does that count as big data management? What if a company simply ships their petabytes to the cloud and accesses them using a giant cloud-provided fileserver? Does that count as “big data” if they themselves are not actually doing any of the management?
Similarly, does it count as big data analytics if we use a point-and-click analysis tool that alleviates us of the need to write a single line of code? What about a tool like Google’s AutoML that leverages transfer learning and incredibly sophisticated model generation and tuning algorithms to quite literally allow the creation of state-of-the-art deep learning models with a few mouse clicks – no coding or AI experience necessary? Does using or deploying an AutoML model count as big data even if we didn’t have to write any code ourselves?
Perhaps the answer to what counts as “big data” lies in how we use all of that data.
Using Google to conduct a keyword search implies a human-directed task. A human being has a question, translates that question into a query, enters that query into a search box and peruses the results. Such a workflow hardly seems to justify the big data label.
What if instead, Google’s algorithms monitored the world’s information on our behalf, searching out insights and new developments it believes are of greatest relevance to us and providing us a real-time summarized digest of the top highlights most relevant to our needs at the moment.
The latter sounds far more like a “big data” application that the former, yet both involve the exact same dataset being searched.
In fact, in many ways the functional tasks each performs are the same. The difference lies in who performs that analysis – the machine or the human. When a human manually queries a dataset is it “big data” or does an analysis require some degree of creative or advanced machine assistance to be worthy of that moniker?
Using the example of an SQL query, if a human manually interrogates a dataset using simplistic queries like counting rows that match different criteria, it seems to strain credibility to call such tasks, which differ little from a keyword search under a different name, as “big data.” Alternatively, if a human interrogates that same dataset using more complex queries like applying machine learning algorithms or complex analytic models, the label would seem to more readily apply.
Putting this all together, perhaps instead of focusing on petabytes or exabytes or trillions of rows, the answer to what constitutes “big data” lies in what we do with all of that data. Simple keyword searches or SQL queries might interrogate exabytes, but it seems unreasonable to classify every Google search as a “big data analysis.” Instead, if we focus on how that data is used and in particular the use of machine creativity to analyze data proactively on our behalves or to surface patterns and trends we were not expecting or to perform complex queries on our behalves, perhaps that might yield a more satisfactory definition.
In the end, shifting our gaze from how much data we hoard to what we actually do with all of that data would go a long way towards moving the field from meaningless marketing buzzword towards genuine business insights.