How to create a data set for machine learning with limited data

aiuniverse — Thu, 31 Oct 2019 07:32:38 +0000

Source:

Most companies remain in the research and development phase of AI implementation, and one reason why few have actual AI deployments is that data science teams are facing data shortages. Analysts agree that the more data you have, the better trained your models will be. So how does a data shortage factor in when determining how to create a data set for machine learning? The solution may be to look for data in unique places and pull from research and prior collection.

At the recent AI World Conference & Expo, data scientist Madhu Bhattacharyya, managing director of enterprise data and analytics at global consultancy firm Protiviti talked internal data shortages, mediating bias and the importance of external data collection

What are some tips for how to create a data set for machine learning if you have limited internal data?

Madhu Bhattacharyya: In reality, the more data you have, the better the model is because you can check for seasonality, you can check for factors that become inherent to the model when you’re building it. From a prediction perspective, accuracy also increases with more data.

So if the data you have is very lean, or you’re a company that doesn’t have enough data, but wants to come up with insights, you need to figure out a way — through analytics, analysis, data multiplication or data mining exercises.

Say you’re a startup, or you’re just developing a new product. There will be some data which will be available right away, because before you start up with something, you do a lot of research. Nothing starts off out of the blue. Before releasing any product or service, think of what you do that collects data. You check for viability, you check for market penetration, you check for potential ROI.

If you’re selling a product, a platform as a service or a service, even before you generate your own data, you will have the initial market data that you researched. How did you identify your potential customer? How do you identify that you need to have the launch in Boston versus in Dallas, for example? All of that information that helped you strategics multiple angles before the launch of the product is useful for building models and creating a data pipeline.

Don’t restrict yourself only to internal data. Try and bring in relevant external data. Ideally, you want a huge amount of data to fall back on from an amalgamation of both internal and external data that actually makes models and AI training much more robust from a decision-making perspective.

As data scientists, we need to check for data bias, even at the very outset when you are actually bringing in the data. Check the data. Do data cleansing to check for data quality. Make sure that your data is not replicated or that you don’t have the same line item multiple times and it is unique. Check for variable reduction to make sure that you have the right set of data. The intent is to reduce bias in the output of the model.

Most companies think that when you get external data, you don’t have control over it, but when you buy or acquire data, there is an expectation that the data is unbiased and clean.

Then you have your own internal data that you work with, and that is where you can actually have your data quality checks in place. When you try to build analytical models using both internal and external data, the first thing to look for is the data that you want to use for the model and check for multiple linearity. If there are five variables which are interdependent on each other — which means they are correlated and the presence of one would mean the presence of the others — then we keep one and we remove the others because, you know, we don’t want to bring in that bias.

In talking about data quantity, you don’t have to work with every single variable in the data set. Bring it down to whatever is relevant and significant for that particular model or solves that particular business objective. Bring in whatever clean data you have and realize what model building you can perform with your existing data and the external data that you have. With that concept, we can actually build models or algorithms, while doing more analysis and data mining, to come up with some insights

The post How to create a data set for machine learning with limited data appeared first on Artificial Intelligence.

Managing Deep Learning Development Complexity

aiuniverse — Wed, 02 Aug 2017 07:15:16 +0000

Source – nextplatform.com

For developers, deep learning systems are becoming more interactive and complex. From the building of more malleable datasets that can be iteratively augmented, to more dynamic models, to more continuous learning being built into neural networks, there is a greater need to manage the process from start to finish with lightweight tools.

“New training samples, human insights, and operation experiences can consistently emerge even after deployment. The ability of updating a model and tracking its changes thus becomes necessary,” says a team from Imperial College London that has developed a library to manage the iterations deep learning developers make across complex projects. “Developers have to spend massive development cycles on integrating components for building neural networks, managing model lifecycles, organizing data, and adjusting system parallelism.”

To better manage development, the team developed TensorLayer, an integrated development approach via a versatile Python library where all elements (operations, model lifecycles, parallel computation, failures) are abstracted in a modular format. These modules include one for managing neural network layers, another for models and their lifecycles, yet another to manage the dataset by providing a unified representation for all training data across all systems, and finally, a workflow module that addresses fault tolerance. As the name implies, TensorFlow is the core platform for training and inference, which feeds into MongoDB for storage—a common setup for deep learning research shops.

A deep learning developer writes a multimedia application with the help of functions from TensorLayer. These functions range from providing and importing layer implementations, to building neural networks, to managing model life-cycles, to creating online or offline datasets, and to writing training plans. These functions are grouped into four modules: layer, network, dataset, and workflow.

The team says that while existing tools like Keras ad TFLearn are useful they are not as extensible as they need to be as networks become more complex and iterative. They provide imperative abstractions to lower adoption barrier; but in turn mask the underlying engine from users. Though good for bootstrap, it becomes hard to tune and modify from the bottom, which is quite necessary in tackling many real-world problems.

Compared with Keras and TFLearn, TensorLayer provides not only the high level abstraction, but also an end-to-end workflow including data pre-processing, training, post-processing, serving modules and database management, which are all keys for developers building the entire system.

TensorLayer advocates a more flexible and composable paradigm: neural network libraries shall be used interchangeably with the native engine. This allows users to tap into the ease of pre-built modules without losing visibility. This noninvasive nature also makes it viable to consolidate with other TF’s wrappers such as TF-Slim and Keras. However, the team argues, flexibility does not sacrifice performance.

There are a number of applications the team highlights in the full paper, which also provides details about each of the modules, the overall architecture, and current developments. The applications include generative adversarial networks, deep reinforcement learning, hyperparameter tuning in end user context. TensorLayer has been also used for multi-model research, image transformation, and medical signal processing since its GitHub release last year.

TensorLayer is in an active development stage and has received numerous contributions from an open community. It has been widely used by researchers from Imperial College London, Carnegie Mellon University, Stanford University, Tsinghua University, UCLA, Linköping University and etc., as well as engineers from Google, Microsoft, Alibaba, Tencent, ReFULE4, Bloomberg and many others.

The post Managing Deep Learning Development Complexity appeared first on Artificial Intelligence.

Development Complexity Archives - Artificial Intelligence

How to create a data set for machine learning with limited data

Managing Deep Learning Development Complexity