Building a data science pipeline: Benefits, cautions
Source – techtarget.com
Enterprises are adopting data science pipelines for artificial intelligence, machine learning and plain old statistics. A data science pipeline — a sequence of actions for processing data — will help companies be more competitive in a digital, fast-moving economy.
Before CIOs take this approach, however, it’s important to consider some of the key differences between data science development workflows and traditional application developmentworkflows.
Data science development pipelines used for building predictive and data science models are inherently experimental and don’t always pan out in the same way as other software development processes, such as Agile and DevOps. Because data science models break and lose accuracy in different ways than traditional IT apps do, a data science pipeline needs to be scrutinized to assure the model reflects what the business is hoping to achieve.
At the recent Rev Data Science Leaders Summit in San Francisco, leading experts explored some of these important distinctions, and elaborated on ways that IT leaders can responsibly implement a data science pipeline. Most significantly, data science development pipelines need accountability, transparency and auditability. In addition, CIOs need to implement mechanisms for addressing the degradation of a model over time, or “model drift.” Having the right teams in place in the data science pipeline is also critical: Data science generalists work best in the early stages, while specialists add value to more mature data science processes.
Data science at Moody’s
“As soon as a new model is built, it is at its peak performance, and over time, they get worse,” Grotta said. Declining model performance can have significant impacts. For example, in the finance industry, a model that doesn’t accurately predict mortgage default rates puts a bank in jeopardy.
Watch out for assumptions
Grotta said it is important to keep in mind that data science models are created by and represent the assumptions of the data scientists behind them. Before the 2008 financial crisis, a firm approached Grotta with a new model for predicting the value of mortgage-backed derivatives, he said. When he asked what would happen if the prices of houses went down, the firm responded that the model predicted the market would be fine. But it didn’t have any data to support this. Mistakes like these cost the economy almost $14 trillion by some estimates.
The first line of defense is to encourage the data modelers to be honest about what they do and don’t know and to be clear on the questions they are being asked to solve. “It is not an easy thing for people to do,” Grotta said.
A second line of defense is verification and validation. Model verification involves checking to see that someone implemented the model correctly, and whether mistakes were made while coding it. Model validation, in contrast, is an independent challenge process to help a person developing a model to identify what assumptions went into the data. Ultimately, Grotta said, the only way to know if the modeler’s assumptions are accurate or not is to wait for the future.
A third line of defense is an internal audit or governance process. This involves making the results of these models explainable to front-line business managers. Grotta said he was working with a bank recently that protested its bank managers would not use a model if they didn’t understand what was driving its results. But he said the managers were right to do this. Having a governance process and ensuring information flows up and down the organization is extremely important, Grotta said.
Baking in accountability
Models degrade or “drift” over time, which is part of the reason organizations need to streamline their model development processes. It can take years to craft a new model. “By that time, you might have to go back and rebuild it,” Grotta said. Critical models must be revalidated every year.
To address this challenge, CIOs should think about creating a data science pipeline with an auditable, repeatable and transparent process. This promises to allow organizations to bring the same kind of iterative agility to model development that Agile and DevOps have brought to software development.
Transparent means that upstream and downstream people understand the model drivers. It is repeatable in that someone can repeat the process around creating it. It is auditable in the sense that there is a program in place to think about how to manage the process, take in new information, and get the model through the monitoring process. There are varying levels of this kind of agility today, but Grotta believes it is important for organizations to make it easy to update data science models in order to stay competitive.
How to keep up with model drift
Nick Elprin, CEO and co-founder of Domino Data Lab, a data science platform vendor, agreed that model drift is a problem that must be addressed head on when building a data science development pipeline. In some cases, the drift might be due to changes in the environment, like changing customer preferences or behavior. In other cases, drift could be caused by more adversarial factors. For example, criminals might adopt new strategies for defeating a new fraud detection model.