Selecting and Preparing Data for Machine Learning Projects
Data is the foundation of any machine learning model. Indeed, there are similarities between the data required for machine learning and any other data-centric project. In all kinds of projects, senior executives need to undertake proper levels of diligence to ensure that the data is reliable, consistent, and comprehensive. However, some data concerns are specific to machine learning. When engaging with data in machine learning projects, it helps to consider:
- How much data is needed?
- Is there potential for cross-contamination of data?
- Is there bias in the data?
- How is non-numeric data treated?
While every machine learning problem is unique, and the amount of data required depends on the complexity of the exercise and the quality of the data, the answer is often “less than you think.”
While the term “machine learning” is often paired with the term “big data,” in reality, machine learning can also apply to data sets numbering in the thousands or even hundreds.
To test this, we applied common supervised machine learning algorithms in which we weighted 30 separate inputs, so as not to favor one input over another. They were then randomly selected to generate an output. A human analyst would never accurately predict an outcome on this randomly weighted data set. However, many of the machine learning algorithms predicted the outcome with greater than 90% accuracy after 4,000 observations. Big data is not necessary for machine learning to be useful.
Potential for “Cross Contamination”
When training a machine learning model, data is divided into training and testing sets. The algorithm optimizes its predictions on the training set before using the testing set to determine its accuracy. It is essential to be careful that the data in one set doesn’t contaminate the other set.
Dividing the data based on random selection can create problems if the data set has multiple observations of the same input over time. For example, supposed a retail company wanted to build a store profitability predictor using monthly observations of profitability for all locations over the last five years. Randomly splitting the data would result in both the training and testing sets including observations of the same store.
In that scenario, even if we eliminated store IDs from the data, machine learning algorithms would still be able to identify which store was which and accurately predict profitability by store. The algorithm might begin predicting profitability based on what the store ID was and not the other factors on which we were hoping to gain insight. The test vs. train results would reflect artificially high accuracy due to the cross-contamination of data.
We can void this problem by ensuring that we explicitly bifurcate the training and testing sets. In the example above, we could randomly assign the stores to the training set or testing set with no overlap between the two, as opposed to randomly assigning the monthly observations. That would result in more reliable predictions providing insights on the factors in which we are interested.
Is There Bias?
A key benefit of machine learning algorithms is that they do not apply the heuristics and biases prevalent in human decision-making. Algorithms use only the data and features provided to develop an optimal method for making predictions. The flip side is that if there is bias in the data, the algorithms won’t be able to reverse or rectify it.
That fact was famously evident when an audit of a machine learning-driven résumé screening company found that “good job candidates” were most likely to (1) be named Jared and (2) have played high school lacrosse.
Those constructing the algorithm in question probably assumed that by omitting factors such as race, gender, or background, they were creating an unbiased model. However, the data that was used still contained implicit biases (all lacrosse-playing Jareds get selected to the exclusion of other good candidates), thus resulting in an inexcusably biased output. The ratings of prior candidates’ performances were biased because they were made by people of a specific race and background, which resulted in biased outcomes from the algorithm.
In this example, the candidates’ background (factors), including their ranking (outcome), were used to predict rankings for future candidates. When asking the algorithm to predict future rankings, you should consider whether the historic rankings in the data set are biased, as they were in this case. If outcomes are based on human bias, the machine will replicate that bias in its predictions. In this example, the client requested to see the features being weighted and noticed this bias. Note that the screening company didn’t catch this, but the experience of senior executives did.
Treatment of Non-numeric Data
When developing a supervised machine learning algorithm, the data must be numerical. For quantitative measures, like revenue or profit, this poses no problems.
However, most projects require interpretation of non-numeric data, and not carefully transforming the text or labels into numeric data can lead to potential pitfalls. For instance, analysts may convert company sectors to index numbers based on alphabetical order. This approach may be easy to implement, but it could, for example, place “consumer staples” right next to “energy,” which can result in algorithms often recognizing them as being similar.
There are several ways to convert non-numeric data, such as vectorizing text — transforming text labels and their frequencies into numbers a machine can understand — or by simply being more intentional about how lists to order lists. As a stakeholder, your team should consider appropriate options and explore how each one impacts the accuracy of results.
Machine learning models are only as good as the data underlying them. Given their experience and perspective, senior stakeholders can add value to data scientist teams, especially in identifying bias and contamination. Data scientists working alongside senior executives as they consider quantity, quality, bias, and contamination of data is the best practice for the successful implementation of machine learning models.