How to create a data set for machine learning with limited data
Most companies remain in the research and development phase of AI implementation, and one reason why few have actual AI deployments is that data science teams are facing data shortages. Analysts agree that the more data you have, the better trained your models will be. So how does a data shortage factor in when determining how to create a data set for machine learning? The solution may be to look for data in unique places and pull from research and prior collection.
At the recent AI World Conference & Expo, data scientist Madhu Bhattacharyya, managing director of enterprise data and analytics at global consultancy firm Protiviti talked internal data shortages, mediating bias and the importance of external data collection
What are some tips for how to create a data set for machine learning if you have limited internal data?
Madhu Bhattacharyya: In reality, the more data you have, the better the model is because you can check for seasonality, you can check for factors that become inherent to the model when you’re building it. From a prediction perspective, accuracy also increases with more data.
So if the data you have is very lean, or you’re a company that doesn’t have enough data, but wants to come up with insights, you need to figure out a way — through analytics, analysis, data multiplication or data mining exercises.
Say you’re a startup, or you’re just developing a new product. There will be some data which will be available right away, because before you start up with something, you do a lot of research. Nothing starts off out of the blue. Before releasing any product or service, think of what you do that collects data. You check for viability, you check for market penetration, you check for potential ROI.
If you’re selling a product, a platform as a service or a service, even before you generate your own data, you will have the initial market data that you researched. How did you identify your potential customer? How do you identify that you need to have the launch in Boston versus in Dallas, for example? All of that information that helped you strategics multiple angles before the launch of the product is useful for building models and creating a data pipeline.
Don’t restrict yourself only to internal data. Try and bring in relevant external data. Ideally, you want a huge amount of data to fall back on from an amalgamation of both internal and external data that actually makes models and AI training much more robust from a decision-making perspective.
As data scientists, we need to check for data bias, even at the very outset when you are actually bringing in the data. Check the data. Do data cleansing to check for data quality. Make sure that your data is not replicated or that you don’t have the same line item multiple times and it is unique. Check for variable reduction to make sure that you have the right set of data. The intent is to reduce bias in the output of the model.
Most companies think that when you get external data, you don’t have control over it, but when you buy or acquire data, there is an expectation that the data is unbiased and clean.
Then you have your own internal data that you work with, and that is where you can actually have your data quality checks in place. When you try to build analytical models using both internal and external data, the first thing to look for is the data that you want to use for the model and check for multiple linearity. If there are five variables which are interdependent on each other — which means they are correlated and the presence of one would mean the presence of the others — then we keep one and we remove the others because, you know, we don’t want to bring in that bias.
In talking about data quantity, you don’t have to work with every single variable in the data set. Bring it down to whatever is relevant and significant for that particular model or solves that particular business objective. Bring in whatever clean data you have and realize what model building you can perform with your existing data and the external data that you have. With that concept, we can actually build models or algorithms, while doing more analysis and data mining, to come up with some insights