The Decade of Data Science
It’s been a hell of a decade for data science — Watson dominated Jeopardy (2011), the White House announced the first Chief Data Scientist (2015), and Deepfakes provided all the Nicolas Cage films we ever wanted (2017). Even our discipline’s name, data science, came out of the last 10 years. With 2019 coming to a close, there will inevitably be articles and lists with date-specific events like those above. However, there have also been several important trends that have developed slowly over the last 10 years. When I think of data science now compared to where it was in 2010, three items stand out to me.
1. A Language of Our Own
Data scientists often come to the profession from a variety of backgrounds, Mathematics, Astronomy, or Medicine. This all lends perspective to our work, but one of the early drawbacks of having many backgrounds was that it resulted in a tower of Babel of programming languages. The early 2010s saw the beginning of a language war. Paid solutions like STATA, SPSS, SAS, Matlab, and Mathematica vied with open source tools like R, C++, Python, Julia, Octave and Gretl.
At first, R appeared like it might emerge victorious. Microsoft even made a play by acquiring Revolution Analytics (a company that provides an enhanced version of R). Despite this early lead, 2015 marked a shift towards Python. Its impressive library of packages, easy-to-learn syntax, and solid performance all played in its favor.
Fast forward to today. The war is over, new borders are drawn, and Python is the language of choice. This isn’t to say that all other languages are dead. R is still heavily used in statistics, and depending on your industry, you might be utilizing SAS or Matlab. However, 9 times out of 10, the answer is Python. The benefits of this shared language can’t be overstated. It makes it easier to read others’ work and code is more easily shared.
2. Jupyter Notebooks & GitHub
Search for a tutorial on any data science subject and whatever article you land on will almost certainly contain a link to a Jupyter Notebook hosted on GitHub. This practice is so common it’s essentially an unspoken standard. And rightfully so: it’s a powerful combo. Jupyter Notebooks combine your documentation and code into a single file that makes understanding the what and why much easier. Couple this with GitHub and you not only have a method for explaining your work but also a way of sharing it. GitHub has even taken steps to embrace Jupyter Notebooks by allowing them to be rendered right in the browser.
Where Python gives us a common language, Jupyter and GitHub have given us a common grammar and etiquette of sharing.
3. Proliferation of Datasets
10 years ago data was much harder to come by. Even when you did find a dataset on your desired topic, chances were it was small. There was data out there, but it was unorganized, formats were inconsistent, and people were less likely to share what they had collected. Before I had the MNIST or Titanic dataset, I had the “Cars” dataset from STATA. It was a collection of vehicles from 1978 that included features like headroom and fuel efficiency and came with a whopping 74 observations.
Compare that to now, where you literally have companies posting their data for the world to see. In a way, Kaggle has become the Tinder of corporate data. Instead of being hesitant, companies are now enthusiastically sharing data, hoping to get a swipe right from an eager young quant. Local governments have also stepped up their game. Cities like Chicago and New York have made it possible for anyone to get access to data on topics ranging from road conditions to rodent sightings. Resources for searching datasets have also improved with tools like the r/datasets subreddit and the Google Dataset Search. If data is the new oil, then every data scientist today probably feels a little like the Daniel Day-Lewis in There Will Be Blood.
The Common Thread: Open Science
There’s a through line in the above topics, they each represent a move towards Open Science or the philosophy that tools, methods, and data should be made openly available. It’s the adoption of this ideology that I think has been one of the most profitable investments our community has made in the last decade, aiding our growth, expanding our capabilities, and strengthening the quality of our work.
In terms of growth, the gates to data science are now open to all, in contrast to less than a decade ago when they required a hefty toll of entry. Before Python, SAS was the closest thing to a standard language, and before Jupyter Notebooks there were Wolfram Notebooks. But these solutions weren’t open and free. Depending on the service, you would have been paying between $800 — $8000 just to run a regression with an “industry standard” tool. Even their websites don’t make it easy for beginners. Visit any of them and you’re presented with dozens of different versions and links asking if you’d like to “get a quote,” making the experience feel less like diving into a dynamic discipline and more like buying a used car.
Commercial software for data science has waned over the last decade. I won’t claim to know the exact reason, but I’ll give my perspective: nothing extinguishes curiosity like a price tag. When I was nearing the end of my undergrad, I remember thinking “I like doing numerical analysis, but there’s no way I can afford STATA. I guess I’d better learn this R thing.” I wasn’t alone, there has been a massive influx of amateurs who are curious and eager to test the waters of data science.
In addition to easy entry, open science has expanded the depth and breadth of everyone’s capabilities. A single data scientist isn’t necessarily a single resource. There’s a massive library of code from contributors around the world. This may take the form of Python packages, Medium tutorials, or Stack Overflow answers. This code corpus makes projects less about building something from scratch and more like mixing and matching Lego sets.
This openness also helps with keeping our work honest. Along with data science, this decade also brought us the phrase “replication crisis.” While there’s no cure-all for fraudulent work, transparency is a potent preventative measure. One of my favorite articles of the decade, The Scientific Paper is Obsolete, has a great quote from Wolfram talking about notebooks, saying, “there’s no bullshit there. It is what it is, it does what it does. You don’t get to fudge your data.”
Looking Forward, Looking Back
In some ways this wasn’t just a good decade for data science, it was the decade of data science. We entered the zeitgeist of the 2010s. We were given the title of sexiest profession, we got a Brad Pitt and Jonah Hill movie, and millions saw AlphaGO become a world champion. In some ways, I feel like this decade was our coming of age. While the next decade will bring with it many improved tools and methodologies, the 2010s will be a time I look back on with a sense of fondness and nostalgia.