Big Blue opens up hub for machine learning datasets
IBM has launched a repository of datasets for training which data scientists can pick and mix to train their deep learning and machine learning models.
The IBM Data Asset eXchange (DAX) is designed to complement the Model Asset eXchange it launched earlier this year, which offers researchers and developers models to deploy or train with their own data.
In a blog announcing the data exchange, a quartet of IBM luminaries, wrote “Developers adopting ML models need open data that they can use confidently under clearly defined open data licenses.”
The data sets in question will be covered by the Linux Foundation’s Community Data License Agreement (CDLA) open data licensing framework to enable data sharing and collaboration – “where possible”.
DAX will also provide “unique access to various IBM and IBM Research datasets.” Big Blue has pledged to publish further datasets, and said “The datasets on DAX will integrate with IBM Cloud and AI services as appropriate.”
There are other ways to source data and models, with IBM’s announcement referencing GitHub and Kaggle, while the PyTorch hub launched a model repository earlier this year.
IBM claimed DAX would be “unique in its high level of quality and curation”, as it would help developers build “end-to-end” deep learning workflows, and allow “developers to consume open data with confidence under clearly defined open data licenses.”
That might sound rather dull to developers used to skunkworks-like conditions, but as machine learning creeps across the enterprise, compliance and ethical practices become a bigger concern.
“The CODAIT team’s goal is to make it straightforward to use DAX and MAX assets in conjunction with IBM AI products as well as other hybrid, multicloud AI tooling,” the team said, which will presumably be a relief for those developers who don’t want to actually lock themselves into IBM’s way of machine learning.
As of today, there are eight datasets on the exchange, including IBM’s Contracts Proposition Bank, which features text from IBM’s contracts, the NOAA Weather Data set for JFK Airport, and a set containing 100 randomly sampled discussion threads from Ubuntu Forums.