The Proliferation of Data Science Tools & Technology
Source – insidebigdata.com
In this special guest feature, Matthew Mahowald, Lead Data Scientist and Software Engineer for Open Data Group, shares his perspectives on how the speed at which tech and tools have been developed, has caused problems with the way analytic deployment is made possible. Matthew holds a Ph.D. in Mathematics from Northwestern University, with a focus on the geometry of string theory and topological field theory. At Open Data Group, Matthew focuses on developing and deploying machine learning models with FastScore.
The history of predictive analytics might be said to begin with Bayes’ famous theorem relating the conditional probabilities of two events. Even today, the importance of foundational work like Bayes’ theorem cannot be overstated: it is both the basis for most significance tests across the experimental sciences, as well as a useful tool in its own right for assessing correlation.
In recent years, as the sultans of Silicon Valley have pressed both computation speeds and data storage capacities to dizzying heights, researchers and analysts working at the intersection of statistics and computer science have leveraged new tools to chase increasingly sophisticated modeling techniques. This dramatic expansion in both software tools and, especially, the quantity and quality of data available led to the emergence of data science as a discipline, and most important the assets created by a data science teams: predictive analytic models.
However historically, when it was time to deploy a new predictive analytic model into production, the burden of deployment on IT and the production pipeline was fairly minor. Long lead times meant that each model could be manually restructured (and sometimes even translated into another programming language). Moreover, the comparative simplicity of the models themselves meant that this recoding was not unreasonably labor-intensive.
The proliferation of tools and techniques in data science have not changed the fundamental deployment problem. However, the complexity of the models strains the feasibly of traditional deployment methodology. There are now more than 10,000 open-source packages on CRAN (the global R package repository). With open-source projects like Scikit-Learn and Pandas, Python offers similarly comprehensive support. Today’s vast data science environment has the ability to construct a wider variety of models faster, at lower cost, and leveraging more data than ever before.
The trend has seeped into the speed at which analytic models are being built. What used to be a leisurely build, with a small number of fairly simplistic rules-based or linear-regression models each year, has turned into the creation of dozens of complex models leveraging the latest and greatest gradient boosting machine or convolutional neural net toolsets. As a result, the traditional model deployment process becomes simply unsustainable.
So, what’s the solution? It’s imperative that everyone – from IT professionals to data scientists – understand and address the challenges of analytic deployment in the modern era. One way to ensure that an enterprise is making analytic deployment a core competency is with an analytic deployment engine. To find success, such an engine would have properties like:
- Ensuring it’s a software component that sits in the production data pipeline, where it receives and executes models.
- It provides native support (without recoding) for any modeling language or package, that is, the engine is language agnostic.
- It can connect to any data source or sink used in the production data pipeline.
This engine should be simultaneously easy enough to use that the data science team can validate and deploy models without requiring IT involvement, and sufficiently robust and scalable that it can be used with confidence in the production pipeline.
Finally (and most importantly), an analytic deployment engine should be future-proof: new libraries and packages in R and Python shouldn’t require upgrading the engine, nor should the emergence of other new techniques and tools.
As organizations continue to gather massive data sets and develop more advanced analytic models to extract value, the number of barriers that are being encountered continue to pile up. By having the right set of data science tools that focus on analytic deployment technology, the IT and Analytics teams can find that sweet spot of success to drive ROI for their businesses.