Avoiding Garbage in Machine Learning
“Garbage in, garbage out” – it’s a cliché in machine learning circles. Anyone who works with artificial intelligence (AI) knows that the quality of the data goes a long way toward determining the quality of the result. But “garbage” is a broad and expanding category in data science – poorly labeled or inaccurate data, data that reflects underlying human prejudices, incomplete data. To paraphrase Tolstoy, great datasets are all alike, but all garbage datasets are garbage in their own, unique and horrible ways.
People believe in machine learning. Israeli philosopher and historian Yuval Noah Harrari coined the term “dataism” to describe a blind faith in algorithms. This faith extends beyond machine learning’s ability to analyze data. Many people believe machine learning is auto-magically able to predict the future. This is inaccurate. Machine learning is excellent at identifying patterns in large well-labeled datasets. In certain cases, those patterns will continue to unfold in the future. In other cases, they won’t. However, the machine-learning researcher must assume responsibility for the perception that AI is predictive. Mitigating the effects of garbage data becomes a moral imperative when your algorithm is being used to make parole decisions or to invest billions in hard-earned pension dollars.
Faith in machine learning isn’t misplaced. The reason machine learning is considered more objective than other methods of data analysis is because it is demonstrably superior. Machine learning is able to produce models that can analyze larger, more complex datasets than conventional methods, and deliver more accurate results, all at scale. When used properly, machine learning can help companies to identify profitable opportunities and avoid risks. But even a technology so advanced can’t perform the alchemy of turning garbage into gold. Learning to identify “garbage” data is the first step in unlocking machine learning’s vast potential.
“Machine learning is not magic,” cautions Nick Cafferillo, chief data and technology officer at S&P Global. “It’s about augmenting our thought processes to help prove – or disprove – a hypothesis we have; a hypothesis that is founded on real business questions that we understand in the abstract.”
Garbage data is a concern when you’re working with machine learning because there are two opportunities for the garbage to mess up your results. First, if you train your machine learning model with garbage data, you have baked bad data into the underlying algorithm. Feed good data into a neural network trained on garbage and your results may be inaccurate. Alternately, you could train your neural network on great data and then run garbage data through the well-trained algorithm. Either way, your output will be questionable.