Data Mining Tool Could Help Train Machine Learning Models
December 24, 2019 – Researchers at Purdue University have created a new framework for mining data to train machine learning models used in drug development.
Using machine learning for drug development requires researchers to create a process for the computer to extract needed information from a pool of data points. Drug scientists have to pull biological data and train the software to understand how a typical human body will interact with the combinations used to form medications.
Drug discovery researchers developed the Lemon framework, which helps drug researchers better mine the Protein Data Base (PDB), a comprehensive resource with more than 140,000 biomolecular structures and with new ones released every week.
“PDB is an essential tool for the drug discovery community,” said Gaurav Chopra, an assistant professor of analytical and physical chemistry in Purdue’s College of Science who works with other researchers in the Purdue Institute for Drug Discovery and led the team that created Lemon.
“The problem is that it can take an enormous amount of time to sort through all the accumulated data. Machine learning can help, but you still need a strong framework from which the computer can quickly analyze data to help in the creation of safe and effective drugs.”
The Lemon software platform mines the PDB within minutes. The platform also allows users to write custom functions, include it as part of their software suite, and develop custom functions in a standard manner to generate unique benchmarking datasets for the entire scientific community.
The Lemon platform was originally designed to create benchmarking sets for drug design software and identify the lemons, or the biomolecular interactions that can’t be modeled well, in the PDB.
“Experimental structures deposited in PDB have resulted in several advances for structural and computational biology scientific and education communities that help advance drug development and other areas,” said Jonathan Fine, a PhD student in chemistry who worked with Chopra to develop the platform.
“We created Lemon as a one-stop-shop to quickly mine the entire data bank and pull out the useful biological information that is key for developing drugs.”
As machine learning becomes more integral to the healthcare industry, researchers have attempted to improve the accuracy and efficiency of these algorithms. Recently, a team at Penn Medicine discovered a once-hidden, through-line between two widely used predictive models that could increase the accuracy of machine learning tools. The discovery could expand the use of machine learning throughout healthcare and other industries.
“The expansion of machine learning to high-stakes application domains such as medicine, finance, and criminal justice, where making informed decisions requires clear understanding of the model, has increased the interest in interpretable machine learning,” researchers wrote in a study published in Proceedings in the National Journal of Sciences (PNAS).
“Despite recent efforts to formalize the concept of interpretability in machine learning, there is considerable disagreement on what such a concept means and how to measure it.”
Additionally, MIT researchers recently developed an automated system that can gather more data from images used to train machine learning models, including algorithms that can analyze brain scans to help treat and diagnose neurological conditions.
“We’re hoping this will make image segmentation more accessible in realistic situations where you don’t have a lot of training data,” said first author Amy Zhao, a graduate student in the Department of Electrical Engineering and Computer Science (EECS) and Computer Science and Artificial Intelligence Laboratory (CSAIL).
“In our approach, you can learn to mimic the variations in unlabeled scans to intelligently synthesize a large dataset to train your network.”