datasets Archives - Artificial Intelligence

Call to Action to the Tech Community on New Machine Readable COVID-19 Dataset

aiuniverse — Tue, 17 Mar 2020 10:08:22 +0000

Source: whitehouse.gov

Today, researchers and leaders from the Allen Institute for AI, Chan Zuckerberg Initiative (CZI), Georgetown University’s Center for Security and Emerging Technology (CSET), Microsoft, and the National Library of Medicine (NLM) at the National Institutes of Health released the COVID-19 Open Research Dataset (CORD-19) of scholarly literature about COVID-19, SARS-CoV-2, and the Coronavirus group.

Requested by The White House Office of Science and Technology Policy, the dataset represents the most extensive machine-readable Coronavirus literature collection available for data and text mining to date, with over 29,000 articles, more than 13,000 of which have full text.

Now, The White House joins these institutions in issuing a call to action to the Nation’s artificial intelligence experts to develop new text and data mining techniques that can help the science community answer high-priority scientific questions related to COVID-19.

The collection was constructed via a unique collaboration between Microsoft, NLM, CZI, and the Allen Institute for AI, coordinated by Georgetown University. Microsoft’s web-scale literature curation tools were used to identify and bring together worldwide scientific efforts and results, CZI provided access to pre-publication content, NLM provided access to literature content, and the Allen AI team transformed the content into machine-readable form, making the corpus ready for analysis and study.

The CORD-19 resource is available on the Allen Institute’s SemanticScholar.org website and will continue to be updated as new research is published in archival services and peer-reviewed publications. Researchers should submit the text and data mining tools and insights they develop in response to this call to action via the Kaggle platform. Through Kaggle, a machine learning and data science community owned by Google Cloud, these tools will be openly available for researchers around the world.

To inform the call to action, key scientific questions related to COVID-19 were developed in coordination with the National Academies of Sciences, Engineering, and Medicine’s Standing Committee on Emerging Infectious Diseases and 21^st Century Health Threats and the World Health Organization. The call to action and key questions are both available on Kaggle.

“Decisive action from America’s science and technology enterprise is critical to prevent, detect, treat, and develop solutions to COVID-19. The White House will continue to be a strong partner in this all hands-on-deck approach. We thank each institution for voluntarily lending its expertise and innovation to this collaborative effort, and call on the United States research community to put artificial intelligence technologies to work in answering key scientific questions about the novel Coronavirus,” said Michael Kratsios, U.S. Chief Technology Officer, The White House.

“This valuable new resource is the fruit of unselfish collaboration and now offers the opportunity to find answers to important questions about COVID-19,” said Dr. Dewey Murdick, Director of Data Science at Georgetown University’s Center for Security and Emerging Technology (CSET), who coordinated the cross-team effort. “Once the crisis has passed, we hope this project will inspire new ways to use machine learning to advance scientific research.”

“It’s all-hands on deck as we face the COVID-19 pandemic,” said Dr. Eric Horvitz, Chief Scientific Officer at Microsoft. “We need to come together as companies, governments, and scientists and work to bring our best technologies to bear across biomedicine, epidemiology, AI, and other sciences. The COVID-19 literature resource and challenge will stimulate efforts that can accelerate the path to solutions on COVID-19.”

“Sharing vital information across scientific and medical communities is key to accelerating our ability to respond to the coronavirus pandemic,” said Dr. Cori Bargmann, Head of Science at the Chan Zuckerberg Initiative. “The new COVID-19 Open Research Dataset will help researchers worldwide to access important information faster.”

“We are excited to be part of this collaboration to aid in the COVID-19 response and that the group is making use of our open access subset on coronavirus literature,” said Dr. Patricia Flatley Brennan, Director of the National Library of Medicine at the National Institutes of Health. “Our current collection of more than 10,000 full-text scholarly articles related to coronavirus provides a critical resource for text mining efforts like this one.”

“One of the most immediate and impactful applications of AI is in the ability to help scientists, academics, and technologists find the right information in a sea of scientific papers to move research faster. We applaud the OSTP, WHO, NIH and all organizations that are taking a proactive approach to use the most advanced technology in the fight against COVID-19,” said Dr. Oren Etzioni, Chief Executive Officer of the Allen Institute for AI. “The Allen Institute for AI, and particularly the Semantic Scholar team, is committed to updating and improving this important resource and the associated AI methods the community will be using to tackle this crucial problem.”

“It’s difficult for people to manually go through more than 20,000 articles and synthesize their findings. Recent advances in technology can be helpful here. We’re putting machine readable versions of these articles in front of our community of more than 4 million data scientists. Our hope is that AI can be used to help find answers to a key set of questions about COVID-19,” said Anthony Goldbloom, Co-Founder and Chief Executive Officer at Kaggle.

The post Call to Action to the Tech Community on New Machine Readable COVID-19 Dataset appeared first on Artificial Intelligence.

The Decade of Data Science

aiuniverse — Thu, 02 Jan 2020 07:14:26 +0000

Source: towardsdatascience.com

It’s been a hell of a decade for data science — Watson dominated Jeopardy (2011), the White House announced the first Chief Data Scientist (2015), and Deepfakes provided all the Nicolas Cage films we ever wanted (2017). Even our discipline’s name, data science, came out of the last 10 years. With 2019 coming to a close, there will inevitably be articles and lists with date-specific events like those above. However, there have also been several important trends that have developed slowly over the last 10 years. When I think of data science now compared to where it was in 2010, three items stand out to me.

1. A Language of Our Own

Data scientists often come to the profession from a variety of backgrounds, Mathematics, Astronomy, or Medicine. This all lends perspective to our work, but one of the early drawbacks of having many backgrounds was that it resulted in a tower of Babel of programming languages. The early 2010s saw the beginning of a language war. Paid solutions like STATA, SPSS, SAS, Matlab, and Mathematica vied with open source tools like R, C++, Python, Julia, Octave and Gretl.

At first, R appeared like it might emerge victorious. Microsoft even made a play by acquiring Revolution Analytics (a company that provides an enhanced version of R). Despite this early lead, 2015 marked a shift towards Python. Its impressive library of packages, easy-to-learn syntax, and solid performance all played in its favor.

Fast forward to today. The war is over, new borders are drawn, and Python is the language of choice. This isn’t to say that all other languages are dead. R is still heavily used in statistics, and depending on your industry, you might be utilizing SAS or Matlab. However, 9 times out of 10, the answer is Python. The benefits of this shared language can’t be overstated. It makes it easier to read others’ work and code is more easily shared.

2. Jupyter Notebooks & GitHub

Search for a tutorial on any data science subject and whatever article you land on will almost certainly contain a link to a Jupyter Notebook hosted on GitHub. This practice is so common it’s essentially an unspoken standard. And rightfully so: it’s a powerful combo. Jupyter Notebooks combine your documentation and code into a single file that makes understanding the what and why much easier. Couple this with GitHub and you not only have a method for explaining your work but also a way of sharing it. GitHub has even taken steps to embrace Jupyter Notebooks by allowing them to be rendered right in the browser.

Where Python gives us a common language, Jupyter and GitHub have given us a common grammar and etiquette of sharing.

3. Proliferation of Datasets

10 years ago data was much harder to come by. Even when you did find a dataset on your desired topic, chances were it was small. There was data out there, but it was unorganized, formats were inconsistent, and people were less likely to share what they had collected. Before I had the MNIST or Titanic dataset, I had the “Cars” dataset from STATA. It was a collection of vehicles from 1978 that included features like headroom and fuel efficiency and came with a whopping 74 observations.

Compare that to now, where you literally have companies posting their data for the world to see. In a way, Kaggle has become the Tinder of corporate data. Instead of being hesitant, companies are now enthusiastically sharing data, hoping to get a swipe right from an eager young quant. Local governments have also stepped up their game. Cities like Chicago and New York have made it possible for anyone to get access to data on topics ranging from road conditions to rodent sightings. Resources for searching datasets have also improved with tools like the r/datasets subreddit and the Google Dataset Search. If data is the new oil, then every data scientist today probably feels a little like the Daniel Day-Lewis in There Will Be Blood.

The Common Thread: Open Science

There’s a through line in the above topics, they each represent a move towards Open Science or the philosophy that tools, methods, and data should be made openly available. It’s the adoption of this ideology that I think has been one of the most profitable investments our community has made in the last decade, aiding our growth, expanding our capabilities, and strengthening the quality of our work.

In terms of growth, the gates to data science are now open to all, in contrast to less than a decade ago when they required a hefty toll of entry. Before Python, SAS was the closest thing to a standard language, and before Jupyter Notebooks there were Wolfram Notebooks. But these solutions weren’t open and free. Depending on the service, you would have been paying between $800 — $8000 just to run a regression with an “industry standard” tool. Even their websites don’t make it easy for beginners. Visit any of them and you’re presented with dozens of different versions and links asking if you’d like to “get a quote,” making the experience feel less like diving into a dynamic discipline and more like buying a used car.

Commercial software for data science has waned over the last decade. I won’t claim to know the exact reason, but I’ll give my perspective: nothing extinguishes curiosity like a price tag. When I was nearing the end of my undergrad, I remember thinking “I like doing numerical analysis, but there’s no way I can afford STATA. I guess I’d better learn this R thing.” I wasn’t alone, there has been a massive influx of amateurs who are curious and eager to test the waters of data science.

In addition to easy entry, open science has expanded the depth and breadth of everyone’s capabilities. A single data scientist isn’t necessarily a single resource. There’s a massive library of code from contributors around the world. This may take the form of Python packages, Medium tutorials, or Stack Overflow answers. This code corpus makes projects less about building something from scratch and more like mixing and matching Lego sets.

This openness also helps with keeping our work honest. Along with data science, this decade also brought us the phrase “replication crisis.” While there’s no cure-all for fraudulent work, transparency is a potent preventative measure. One of my favorite articles of the decade, The Scientific Paper is Obsolete, has a great quote from Wolfram talking about notebooks, saying, “there’s no bullshit there. It is what it is, it does what it does. You don’t get to fudge your data.”

Looking Forward, Looking Back

In some ways this wasn’t just a good decade for data science, it was the decade of data science. We entered the zeitgeist of the 2010s. We were given the title of sexiest profession, we got a Brad Pitt and Jonah Hill movie, and millions saw AlphaGO become a world champion. In some ways, I feel like this decade was our coming of age. While the next decade will bring with it many improved tools and methodologies, the 2010s will be a time I look back on with a sense of fondness and nostalgia.

The post The Decade of Data Science appeared first on Artificial Intelligence.

NumPy 1.17.0 is here, officially drops Python 2.7 support pushing forward Python 3 adoption

aiuniverse — Thu, 01 Aug 2019 05:40:00 +0000

Source: hub.packtpub.com

Last week, the Python team released NumPy version 1.17.0. This version has many new features, improvements and changes to increase the performance of NumPy.

The major highlight of this release includes a new extensible numpy.random module, new radix sort & timsort sorting methods and a NumPy pocketfft FFT implementation for accurate transforms and better handling of datasets of prime length. Overriding of numpy functions has also been made possible by default.

NumPy 1.17.0 will support Python versions 3.5 – 3.7. Python 3.8b2 will work with the new release source packages, but may not find support in future releases. Python version 2.7 has been officially dropped.

What’s new in NumPy 1.17.0?

New extensible numpy.random module with selectable random number generators

NumPy 1.17.0 has a new extensible numpy.random module. It also includes four selectable random number generators and improved seeding designed for use in parallel processes. PCG64 is the new default numpy.random module while MT19937 is retained for backwards compatibility.

Timsort and radix sort have replaced mergesort for stable sorting

Both the radix sort and timsort have been implemented and can be used instead of mergesort. The sorting kind options ‘stable’ and ‘mergesort’ have been made aliases of each other with the actual sort implementation for maintaining backward compatibility. Radix sort is used for small integer types of 16 bits or less and timsort is used for all the remaining types of bits.

empty_like and related functions now accept a shape argument

Functions like empty_like, full_like, ones_like and zeros_like will now accept a shape keyword argument, which can be used to create a new array as the prototype and overriding its shape also. These functions become extremely useful when combined with the __array_function__ protocol, as it allows the creation of new arbitrary-shape arrays from NumPy-like libraries.

User-defined LAPACK detection order

numpy.distutils now uses an environment variable, comma-separated and case insensitive detection order to determine the detection order for LAPACK libraries. This aims to help users with MKL installation to try different implementations.

.npy files support unicode field names

A new format version of .npy files has been introduced. This enables structured types with non-latin1 field names. It can be used automatically when needed.

New mode “empty” for pad

The new mode “empty” pads an array to a desired shape without initializing any new entries.

New Deprications in NumPy 1.17.0

numpy.polynomial functions warn when passed float in place of int

Previously, functions in numpy.polynomial module used to accept float values. With the latest NumPy version 1.17.0, using float values is deprecated for consistency with the rest of NumPy. In future releases, it will cause a TypeError.

Deprecate numpy.distutils.exec_command and temp_file_name

The internal use of these functions has been refactored for better alternatives such as replace exec_command with subprocess. Also, replace Popen and temp_file_name with tempfile.mkstemp.

Writeable flag of C-API wrapped arrays

When an array is created from the C-API to wrap a pointer to data, the writeable flag set during creation indicates the read-write nature of the data. In the future releases, it will not be possible to convert the writeable flag to True from python as it is considered dangerous.

Other improvements and changes

Replacement of the fftpack based fft module by the pocketfft library

pocketfft library contains additional modifications compared to fftpack which helps in improving accuracy and performance. If FFT lengths has large prime factors then pocketfftuses Bluestein’s algorithm, which maintains O(N log N) run time complexity instead of deteriorating towards O(N*N) for prime lengths.

Array comparison assertions include maximum differences

Error messages from array comparison tests such as testing.assert_allclos now include “max absolute difference” and “max relative difference” along with previous “mismatch” percentage. This makes it easier to update absolute and relative error tolerances.

median and percentile family of functions no longer warn about nan

Functions like numpy.median, numpy.percentile, and numpy.quantile are used to emit a RuntimeWarning when encountering a nan. Since these functions return the nan value, the warning is redundant and hence has been removed.

timedelta64 % 0 behavior adjusted to return NaT

The modulus operation with two np.timedelta64 operands now returns NaT in case of division by zero, rather than returning zero.

Though users are happy with NumPy 1.17.0 features, some are upset over the Python version 2.7 being officially dropped.

The post NumPy 1.17.0 is here, officially drops Python 2.7 support pushing forward Python 3 adoption appeared first on Artificial Intelligence.

Big Blue opens up hub for machine learning datasets

aiuniverse — Thu, 18 Jul 2019 12:21:37 +0000

Source: devclass.com

IBM has launched a repository of datasets for training which data scientists can pick and mix to train their deep learning and machine learning models.

The IBM Data Asset eXchange (DAX) is designed to complement the Model Asset eXchange it launched earlier this year, which offers researchers and developers models to deploy or train with their own data.

In a blog announcing the data exchange, a quartet of IBM luminaries, wrote “Developers adopting ML models need open data that they can use confidently under clearly defined open data licenses.”

The data sets in question will be covered by the Linux Foundation’s Community Data License Agreement (CDLA) open data licensing framework to enable data sharing and collaboration – “where possible”.

DAX will also provide “unique access to various IBM and IBM Research datasets.” Big Blue has pledged to publish further datasets, and said “The datasets on DAX will integrate with IBM Cloud and AI services as appropriate.”

There are other ways to source data and models, with IBM’s announcement referencing GitHub and Kaggle, while the PyTorch hub launched a model repository earlier this year.

IBM claimed DAX would be “unique in its high level of quality and curation”, as it would help developers build “end-to-end” deep learning workflows, and allow “developers to consume open data with confidence under clearly defined open data licenses.”

That might sound rather dull to developers used to skunkworks-like conditions, but as machine learning creeps across the enterprise, compliance and ethical practices become a bigger concern.

“The CODAIT team’s goal is to make it straightforward to use DAX and MAX assets in conjunction with IBM AI products as well as other hybrid, multicloud AI tooling,” the team said, which will presumably be a relief for those developers who don’t want to actually lock themselves into IBM’s way of machine learning.

As of today, there are eight datasets on the exchange, including IBM’s Contracts Proposition Bank, which features text from IBM’s contracts, the NOAA Weather Data set for JFK Airport, and a set containing 100 randomly sampled discussion threads from Ubuntu Forums.

The post Big Blue opens up hub for machine learning datasets appeared first on Artificial Intelligence.