Databases Archives - Artificial Intelligence

Yale study shows limitations of applying artificial intelligence to registry databases

aiuniverse — Thu, 11 Mar 2021 06:51:52 +0000

Source – https://medicine.yale.edu/

Artificial intelligence will play a pivotal role in the future of health care, medical experts say, but so far, the industry has been unable to fully leverage this tool. A Yale study has illuminated the limitations of these analytics when applied to traditional medical databases — suggesting that the key to unlocking their value may be in the way datasets are prepared.

Machine learning techniques are well-suited for processing complex, high-dimensional data or identifying nonlinear patterns, which provide researchers and clinicians with a framework to generate new insights. Achieving the potential of artificial intelligence will require improving the data quality of electronic health records (EHR).

“Our study found that advanced methods that have revolutionized predictions outside healthcare did not meaningfully improve prediction of mortality in a large national registry. These registries that rely on manually abstracted data within a restricted number of fields may, therefore, not be capturing many patient features that have implications for their outcomes,” said Rohan Khera, MD, MS, the first author of the new study published in JAMA Cardiology. “We believe that the next frontier for improving clinical prediction may be the application of these methods to the high-dimensional granular data collected in the EHR.”

The authors used the American College of Cardiology’s (ACC) Chest Pain-MI Registry from 2011 to 2016, which includes nearly 1 million patients hospitalized for an acute myocardial infarction (AMI) or heart attack across more than 1,000 U.S. hospitals. The researchers applied three different machine learning models to predict death after hospitalization and only observed marginal gains over models using the traditional logistic regression, applied to these nationwide data.

Clinical registries such as the Chest Pain-MI Registry have been the mainstay for assessing patient outcomes across many hospitals through standardized data collection. These registries can advance clinical understanding and knowledge but are less suited at complex data collection and abstraction. To infer additional insights will require rethinking how to aggregate novel digital data streams that are being generated at most U.S. hospitals, the researchers said.

The study also underscores that while some methods are more efficient or transparent, the clinical value of machine learning will be determined by data collection and processing.

“The clinical adoption of machine learning will depend on whether it delivers better information – and that may importantly depend on the data that are used,” said Harlan Krumholz, MD, SM, director of the Center for Outcomes Research and Evaluation (CORE) at Yale and senior author of the study.

The research team included clinicians and scientists from across several Yale departments, including Robert McNamara MD, MHS, Nihar Desai, MD, MPH, Chenxi Huang, PhD, and Bobak Jack Mortazavi, PhD.

The research was supported by the American College of Cardiology Foundation.

The post Yale study shows limitations of applying artificial intelligence to registry databases appeared first on Artificial Intelligence.

The CAP theorem, and how it applies to microservices

aiuniverse — Fri, 11 Dec 2020 05:12:26 +0000

Source: searchapparchitecture.techtarget.com

It’s not unusual for developers and architects who jump into microservices for the first time to “want it all” in terms of performance, uptime and resiliency. After all, these are the goals that drive a software team’s decision to pursue this type of architecture design. The unfortunate truth is that trying to create an application that perfectly embodies all of these traits will eventually steer them to failure.

This phenomenon is summed up in something called the CAP theorem, which states that a distributed system can deliver only two of the three overarching goals of microservices design: consistency, availability and partition tolerance. According to CAP, not only is it impossible to “have it all” — you may even struggle to deliver more than one of these qualities at a time.

When it comes to microservices, the CAP theorem seems to pose an unsolvable problem. Which of these three things can you afford to trade away? However, the essential point is that you don’t have a choice. You’ll have to face that fact when it comes to your design stage, and you’ll need to think carefully about the type of application you’re building, as well as its most essential needs.

In this article, we’ll review the basics of how the CAP theorem applies to microservices, and then examine the concepts and guidelines you can follow when it’s time to make a decision.

CAP theory and microservices

Let’s start by reviewing the three qualities CAP specifically refers to:

Consistency means that all clients see the same data at the same time, no matter the path of their request. This is critical for applications that do frequent updates.
Availability means that all functioning application components will return a valid response, even if they are down. This is particularly important if an application’s user population has a low tolerance for outages (such as a retail portal).
Partition tolerance means that the application will operate even during a network failure that results in lost or delayed messages between services. This comes into play for applications that integrate with a large number of distributed, independent components.

Databases often sit at the center of the CAP problem. Microservices often rely on NoSQL databases, since they’re designed to scale horizontally and support distributed application processes. And, partition tolerance is a “must have” in these types of systems because they are so sensitive to failure.

You can certainly design these kinds of databases for consistency and partition tolerance, or even for availability and partitioning. But designing for consistency and availability just isn’t an option.

The PACELC theorem

This prohibitive requirement for partition-tolerance in distributed systems gave rise to what is known as the PACELC theorem, a sibling to the CAP theorem. The acronym PACELC stands for “if partitioned, then availability and consistency; else, latency and consistency.” In other words: If there is a partition, the distributed system must trade availability for consistency; if not, the choice is between latency and consistency.

Designing your applications specifically to avoid partitioning problems in a distributed system will force you to sacrifice either availability or user experience to retain operational consistency. However, the key term here is “operational” — while latency is a primary concern during normal operations, a failure can quickly make availability the overall priority. So, why not create models for both scenarios?

It may help to frame CAP concepts in both “normal” and “fault” modes, provided that faults in a distributed system are essentially inevitable. This enables you to create two database and microservices implementation models: one that handles normal operation, and another that kicks in during failures. For example, you can design your database to optimize consistency during a partition failure, and then continue to focus on mitigating latency during normal operation.

Applying PACELC to microservices

If we use PACELC rather than “pure CAP” to define databases, we can classify them according to how they make the trades.

In PACELC terms, relational database management systems and NoSQL databases that implement ACID (atomicity, consistency, isolation, urability) are designed to assure consistency, classifying them as PC/EC. Typical business applications, like human resources apps and ticketing systems, will likely use this model, particularly if there are multiple users using different component instances. Google’s Bigtable database is a good example of this.
In-memory databases like MongoDB and Hazelcast fit into a PA/EC model, which is best suited for things like e-commerce apps, which need high availability even during network or component failures.
Real-time applications, such as IoT systems, fit into the PC/EL model that databases like PNUTS provide. This is the case in any application where consistency across replications is critical.
Database systems based on the PA/EL model, such as Dynamo and Cassandra, are best for real-time applications that don’t experience frequent updates, since consistency will be less of an issue.

Know the tradeoffs

The bottom line is this: It’s critical to know exactly what you’re trading in a PACELC-guided application, and to know which scenarios call for which sacrifice. Here are three things to remember when making your decision:

Consistency is most valuable where many users update the same data elements.
Availability is critical for applications involving consumers (who get frustrated easily) and also for some IoT applications.
Latency is most likely critical for real-time and IoT applications where processing delays must be kept to a minimum.

Make your database choice wisely. Then, design your microservices workflows and framework to ensure you don’t compromise your goals.

The post The CAP theorem, and how it applies to microservices appeared first on Artificial Intelligence.

Bridging the gap between databases and data science

aiuniverse — Tue, 09 Jun 2020 06:44:46 +0000

Source: universiteitleiden.nl

Relational databases are used to store information or data in such a way that it preserves relations between the data. This property makes it a useful tool for data scientists. There is, however, a gap between the relational database research community and data scientists. This leads to inefficient use of databases in data science. PhD-student Mark Raasveldt tried to bridge the gap between the relational databases and data science. PhD defense 9 June 2020.

Integration with analytical tools

Most data scientists use analytical tools, such as R, Python and C/C++, for their research. These tools are difficult to integrate with current database systems, resulting in slow and cumbersome data analysis. ‘Data scientists have opted to reinvent database systems by developing a zoo of data management alternatives that perform similar tasks to classical database management systems, but have many of the problems that were solved in the database field decades ago,’ says Raasveldt.

‘The database research community has made tremendous strides in developing powerful database engines that allow for efficient analytical query processing.’ Raasveldt tried to combine these innovations in the database science with the analytical tools that are mostly used by data scientists. ‘We investigate how we can facilitate efficient and painless integration of analytical tools and relational database management systems,’ says Raasveldt.

Large datasets

Another issue with the use of standard database systems in computer science is the size of the data that is handled. Most database systems are not optimized for large data sets and large-scale data analysis using remote servers. To optimize the database systems, there are three methods that can be considered.

‘We focus our investigation on the three primary methods for database-client integration: client-server connections, in-database processing and embedding the database inside the client application,’ Raasveldt explains. For every method, he studied the implementations in existing database systems and he evaluated how efficient they are for the large datasets and workloads that are common in data science.

DuckDB

Raasveldts final result was a new data management system, called DuckDB, that was purpose-built for efficient and painless integration with R and Python (and other analytical tools). This management system is meant to be used as a mature database system that is not only used for research purposes.

‘In DuckDB we take all the lessons that we have learned investigating database-client integrations and create an easy-to-use and highly efficient embedded database.’ Raasveldt will continue his work as a postdoc at the CWI, where he will work on further developing DuckDB.

The post Bridging the gap between databases and data science appeared first on Artificial Intelligence.

NLM Leverages Data, Text Mining to Sharpen COVID-19 Research Databases

aiuniverse — Wed, 13 May 2020 13:25:39 +0000

Sojurce: governmentciomedia.com

The National Library of Medicine is leveraging its database resources and artificial intelligence capabilities to rapidly provide COVID-19 literature and resources to researchers and scientists as the world races to understand and respond to the pandemic.

The White House in March tapped NLM, under the National Institutes of Health, to join a public-private partnership called the COVID-19 Open Research Dataset (CORD-19) to develop data-mining techniques that could help the science community answer critical questions pertaining to COVID-19. Leveraging its existing infrastructure and establishing processes for content submission, NLM has quickly brought access to COVID-19 literature and clinical trial content on its PubMed Central (PMC) and ClinicalTrials.gov databases.

“As of May 1, about 46,000 articles had been deposited by publishers to PMC or updated in PMC to have a license that allows for text and data-mining, of which more than 5,600 articles specifically focus on the current novel coronavirus,” said NLM National Center for Biotechnology Information Acting Director Stephen Sherry. “Some 49 publishers are now included in the PMC COVID-19 initiative.”

Within the first few weeks since launching the project, PMC saw significant COVID-19 download and data-sharing rates, said PMC Program Manager Kathryn Funk in an NIH webinar. As part of the project, Funk’s team worked to standardize submission data in a machine-readable format.

“The early results have been encouraging,” Funk said. “Articles in the Public Health Emergency Collection and PMC were retrieved more than 2 million times in the first two to three weeks of the initiative, and the coordinating dataset has been downloaded more than 75,000 times at this time. It’s our hope that through expanded access and machine learning, NLM will be able to help accelerate scientific research on COVID-19.”

NLM has also leveraged ClinicalTrials.gov’s existing infrastructure to scale up and provide quick access to information about trials related to COVID-19. Teams conducting trials around the world can submit standardized and structured information about their trials directly through an online submission portal called the Protocol Registration and Results system, where trial information is then posted to ClinicalTrials.gov within a couple of days of initial submission, Sherry said.

The data standardization and structure are critical to enabling AI technologies like machine learning and natural-language processing, which can help users more effectively mine and analyze the databases’ resources and literature to generate knowledge and support research that assist in responding to COVID-19, Sherry said.

“ClinicalTrials.gov also leverages NLM resources such as the biomedical vocabularies and standards integrated in the unified Medical Language System (UMLS) to support its search capabilities,” Sherry said, citing the database’s complete list of registered COVID-19 studies. “Users can filter the search results further by different study design characteristics, recruitment status, location information and other factors to identify trials of interest. All of these search capabilities are also available through the ClinicalTrials.gov API.”

Sherry likened the ClinicalTrials.gov infrastructure as an “information scaffold” for discovering information about clinical trials, as the platform applies unique identifiers called National Clinical Trial (NCT) numbers to each trial so that individuals can label and identify trials.

“As a result, different resources with information about particular trials can be linked and discovered through the use of unique NCT numbers, [such as] ClinicalTrials.gov records, press releases, journal articles, protocol document[s], informed consent forms, systematic reviews, reports, regulatory documents, individual participant-level data,” Sherry said.

Creating an open data repository ecosystem like ClinicalTrials.gov requires integrating different data contributors in a way that enable interoperability and usability of data, said NIH Director of Data Science Strategy Susan Gregurick, who helped establish the agency’s data science office in 2018.

“NIH strongly encourages open-access, data-sharing repositories as your first go-to choice when you’re looking for a repository to share your data and your information,” Gregurick said during an agency webinar last month.

Although NLM had already pledged to modernize its databases, support data-driven science, collaborate with relevant stakeholders and build a future-ready workforce in its strategic plan, such as the multi-year effort to overall modernize ClinicalTrials.gov, COVID-19 has sparked a number of new data-backed initiatives and digital resources around COVID-19, said Sherry and Gregurick.

These are not just on PMC and ClinicalTrials.gov, but also on new platforms and resources, including:

LitCovid, a COVID-19-specific open-resource literature hub that curates and disseminates a constantly growing comprehensive collection of international research papers relevant to public health. “This resource builds on NLM research to develop new approaches to locating and indexing the literature related to COVID-19, including a text classification algorithm for screening and ranking relevant documents, topic modeling for suggesting relevant research categories and information extraction for obtaining geographic locations found in the abstract,” Sherry said.
COVID-19 genetic sequence information additions to GenBank, the world’s largest genetic sequence database that released the first COVID-19 sequence to the public Jan. 12 and the first sequence collected in America in collaboration with the Centers for Disease Control and Prevention Jan. 25. “As of April 9, we have 579 SARS-CoV-2 sequences from 26 different countries publicly available,” Sherry said, adding that NLM has create a data hub on GenBank for individuals to search, retrieve and analyze COVID-19 sequences that have been submitted.
The Sequence Read Archive, an 14-petabyte archive of high-throughput genetic sequence data that as of February became available on commercial cloud-computing platforms, which Sherry said significantly expanded the discovery potential of the data to help identify mutational patterns and inform drug and vaccine development.
PubChem, an open chemistry database that contains compounds used in COVID-19 clinical trials and found in COVID-19-related protein database structures.

The post NLM Leverages Data, Text Mining to Sharpen COVID-19 Research Databases appeared first on Artificial Intelligence.

Object Stores Starting to Look Like Databases

aiuniverse — Fri, 17 Apr 2020 10:06:32 +0000

Source:

Don’t look now, but object stores – those vast repositories of data sitting behind an S3 API – are beginning to resemble databases. They’re obviously still separate categories today, but as the next-generation data architecture takes shape to solve emerging real-time data processing and machine learning challenges, the lines separating things like object stores, databases, and streaming data frameworks will begin to blur.

Object stores have become the primary repository for the vast amounts of less-structured data that’s generated today. Organizations clearly are using object-based data lakes in the cloud and on premise to store unstructured data, like images and video. But they’re also using them to store many of the other types of data, like sensor and log data from mobile and IoT devices, that the world is generating.

The object store is becoming a general purpose data repository, and along the way it’s getting closer to the most popular data workloads, including SQL-based analytics and machine learning. The folks at object storage software vendor Cloudian are moving their wares in that direction too, according to Cloudian CTO Gary Ogasawara.

“We’re moving more and more to that,” Ogasawara tells Datanami. “If you can combine the best of both worlds – have the huge capacity of an object store and the advanced query capability of an SQL-type database – that would be the ideal. That’s what people are really asking for.”

Past Is Prologue

We’ve seen this film before. When Apache Hadoop was the hot storage repository for big data (really, less-structured data), the first big community efforts was to develop a relational database for it. That way, data analysts with existing SQL skills – as well as BI applications expecting SQL data – would be able to leverage it without extensive retraining. And besides, after running less-structured data through MapReduce jobs, you needed a place to put the structured data. A database is that logical place.

This led to the creation of Apache Hive out of Facebook, and the community followed with a host of other SQL-on-Hadoop engines (or relational databases, if you like), including Apache Impala, Presto, and Spark SQL, among others. Of course, Hadoop’s momentum fizzled over the past few years, in part due to the rise of S3 from Amazon Web Services and other cloud-based object storage systems, notably Azure BLOB Storage from Microsoft and Google Cloud Storage, which are universally more user-friendly than Hadoop, if not always cheaper.

In the cloud, users are presented with a wide range of specialty storage repositories and processing engines for SQL and machine learning. On the SQL front, you have Amazon RedShift, Azure Data Warehouse, and Google BigQuery. On top of these “native” offerings, the big data community has adapted many existing and popular analytics databases, including Teradata, Vertica, and others, to work with S3 and other object stores with an S3-compatible API.

The same goes for machine learning workloads. Once the data is in S3 (or Blob Store or Google Cloud Storage), it’s a relatively simple manner to use that data to build and train machine learning models in SageMaker, Azure Machine Learning, or Google Cloud AutoML. With the rise of the cloud, every member of the big data and machine learning community has moved to support the cloud, and with it object storage systems.

As the cloud’s momentum grows, S3 has become the defacto data access standard for the next generation of applications, from SQL analytics and machine learning to more traditional apps too. For many new applications, data is simply expected to be stored in an object storage system, and developers expect to be able to access that data over the S3 API.

A Hybrid Architecture

But of course, not all new applications will live on the cloud with ready access to petabytes of data and gigaflops of computing power. In fact, with the rise of 5G networks and the explosion of smart devices on the Internet of Things (IoT), the physical world is the next frontier for computing, and that’s changing the dynamics for data architects who are trying to foresee new trends.

At Cloudian, Ogasawara and his team are working on adapting its HyperStore object storage architecture to fit into the emerging edge-and-hub computing model. One of the examples he uses is the case of an autonomous car. With cameras, LIDAR, and other sensors, each self-driving car generates terabytes worth of data every day, and petabytes per year.

“That is all being generated at the edge,” he says. “Even with a 5G network, you will never be able to transmit all that data to somewhere else for analyses. You have to push that storage and processing as close to the edge as possible.”

Cloudian is currently working on developing a version of HyperStore that sits on the edge. In the self-driving car example, the local version of HyperStore would run right on the car and assist with storing and processing data coming off the sensors in real time. This computing environment would constitute a fast “inner loop,” Ogasawara says.

“But then you have a slower outer loop that’s also collecting data, and that includes the hub where the large, vast data lake resides in object storage,” he continues. “Here you can do more extensively training of ML models, for example, and then push that kind of metadata out to the edge, where it’s essentially a compiled version of your model that can be used very quickly.”

In the old days, object stores resembled relatively simple (and nearly infinitely scalable) key-value stores. But to support future use cases — like self-driving cars as well as weather modeling and genomics — the object store needs to learn new tricks, like how to stream data in and intelligently filter it so that only a subset of the most important data is forwarded from the edge to the hub.

To that end, Cloudian is working on a new project that will incorporate analytics capabilities. It has a working name of the Hyperstore Analytics Platform, the project would incorporate frameworks like Spark or TensorFlow to assist with the intelligent streaming and processing of data. A beta was expected by the end of the year (at least that was the timeline that Ogasawara shared in early March before the COVID-19 lockdown.)

Object’s Evolution

Cloudian is not the only object storage vendor looking at how to evolve its product to adapt to emerging data challenges. In fact, its not just object storage vendors who are trying to tackle the probolem.

The folks at Confluent have adapted their Kafka-based stream processing technologies (which excel at processing event data) to work more like a database, which is good at managing stateful data. MinIO has SQL extensions that allow its object store to function like a database. NewSQL database vendor MemSQL has long had hooks for Kafka that allow it to process large amounts of real-time data. The in-memory data grid (IMDG) vendors are doing similar things for processing new event data within the context of historic, stateful data. And let’s not even get into how the event meshes are solving this problem.

According to Ogasawara, adapting Cloudian’s HyperStore offering is a logical way to tackle today’s emerging data challenges. “You’ve done very well at building this storage infrastructure,” he says. “Now, how do you make the data usable and consumable? It’s really about providing better access APIs to get to that data, and almost making the object storage more intelligent.”

Object stores are moving beyond their initial use case, which was reading, writing, and deleting data at massive scale. Now customers are pushing object storage vendors to support more advanced workflows, including complex machine learning workflows. That will most likely require an extension to the S3 API (something that Cloudian has brought up with AWS, but without much success).

“How do you look into those objects? Those types of APIs need more and more [capabilities],” Ogasawara says. “And even letting AI or machine learning-type workflows, doing things like a sequence of operations — those types of language constructs, everyone is starting to look at and trying to figure out how do we make it easier for users and customers to make that data analysis possible.”

The post Object Stores Starting to Look Like Databases appeared first on Artificial Intelligence.

5 core components of microservices architecture

aiuniverse — Sat, 11 Apr 2020 10:49:37 +0000

Source:

A microservices architecture — as the name implies — is a complex coalition of code, databases, application functions and programming logic spread across servers and platforms. Certain fundamental components of a microservices architecture bring all these entities together cohesively across a distributed system.

In this article, we review five key components of microservices architecture that developers and application architects need to understand if they plan to take the distributed service route. Start with microservices themselves, then learn about service mesh as an additional layer, app management via service discovery, container-based deployment and API gateways.

1. Microservices

Microservices make up the foundation of a microservices architecture. The term illustrates the method of breaking down an application into generally small, self-contained services, written in any language, that communicate over lightweight protocols. With independent microservices, software teams can implement iterative development processes, as well as create and upgrade features flexibly.

Teams need to decide the proper size for microservices, keeping in mind that an overly granular collection of too-segmented services creates high overhead and management needs. Developers should thoroughly decouple services in order to minimize dependencies between them and promote service autonomy. And use lightweight communication mechanisms like REST and HTTP.

2. Containers

Containers are units of software that package services and their dependencies, maintaining a consistent unit through development, test and production. Containers are not necessary for microservices deployment, nor are microservices needed to use containers. However, containers can potentially improve deployment time and app efficiency in a microservices architecture more so than other deployment techniques, such as VMs.

The major difference between containers and VMs is that containers can share an OS and middleware components, whereas each VM includes an entire OS for its use. By eliminating the need for each VM to provide an individual OS for each small service, organizations can run a larger collection of microservices on a single server.

The other advantage of containers is their ability to deploy on-demand without negatively impacting application performance. Developers can also replace, move and replicate them with fairly minimal effort. The independence and consistency of containers is a critical part of scaling certain pieces of a microservices architecture — according to workloads — rather than the whole application. It also supports the ability to redeploy microservices in a failure.

Docker, which started as an open-source platform for container management, is one of the most recognizable providers in the container space. However, Docker’s success caused a large tooling ecosystem to evolve around it, spawning popular container orchestrators like Kubernetes.

3. Service mesh

In a microservices architecture, the service mesh creates a dynamic messaging layer to facilitate communication. It abstracts the communication layer, which means developers don’t have to code in inter-process communication when they create the application.

Service mesh tooling typically uses a sidecar pattern, which creates a proxy container that sits beside the containers that have either a single microservice instance or a collection of services. The sidecar routes traffic to and from the container, and directs communication with other sidecar proxies to maintain service connections.

Two of today’s most popular service mesh options are Istio, a project that Google launched alongside IBM and Lyft, and Linkerd, a project under the Cloud Native Computing Foundation. Both Istio and Linkerd are tied to Kubernetes, though they feature notable differences in areas such as support for non-container environments and traffic control capabilities.

4. Service discovery

Whether it’s due to changing workloads, updates or failure mitigation, the number of microservice instances active in a deployment fluctuate. It can be difficult to keep track of large numbers of services that reside in distributed network locations throughout the application architecture.

Service discovery helps service instances adapt in a changing deployment, and distribute load between the microservices accordingly. The service discovery component is made up of three parts:

A service provider that originates service instances over a network;
A service registry, which acts as a database that stores the location of available service instances; and
A service consumer, which retrieves the location of a service instance from the registry, and then communicates with that instance.

Service discovery also consists of two major discovery patterns:

A client-side discovery pattern searches the service registry to locate a service provider, selects an appropriate and available service instance using a load balancing algorithm, and then makes a request.
In a server-side discovery pattern, the router searches the service registry and, once the applicable service instance is found, forwards the request accordingly.

Data residing in the service registry should always be current, so that related services can find their related service instances at runtime. If the service registry is down, it will hinder all the services, so enterprises typically use a distributed database, such as Apache ZooKeeper, to avoid regular failures.

5. API gateway

Another important component of a microservices architecture is an API gateway. API gateways are vital for communication in a distributed architecture, as they can create the main layer of abstraction between microservices and the outside clients. The API gateway will handle a large amount of the communication and administrative roles that typically occur within a monolithic application, allowing the microservices to remain lightweight. They can also authenticate, cache and manage requests, as well as monitor messaging and perform load balancing as necessary.

Additionally, an API gateway can speed up communication between microservices and clients by standardizing messaging protocols translation and freeing both the client and the service from the task of translating requests written in unfamiliar formats. Most API gateways will also provide built-in security features, which means they can manage authorization and authentication for microservices, as well as track incoming and outgoing requests to identify any possible intrusions.

There are a wide array of API gateway options on the market to choose from, both from proprietary cloud platform providers like Amazon and Microsoft and open source providers such as Kong and Tyk.

The post 5 core components of microservices architecture appeared first on Artificial Intelligence.

Why Big Data Is a Big Opportunity for Employment

aiuniverse — Mon, 13 May 2019 06:03:21 +0000

Source:- insidebigdata.com

As data has transitioned from “nice to have” to a full-blown commodity, we’ve seen demand for data science and analytics (DSA) talent soar — along with opportunities for job-seekers and companies alike.

This is such a critical moment for data science, in fact, that industries across the board are bracing for talent shortages. Here’s why big data is such a massive opportunity — and one that needs the backing of educators at every level.

The Data Science Skills Gap

There are multiple data- and computer science-related skill areas which experts expect to remain in high demand throughout 2019 and beyond. These include:

Programming languages
Data visualization
AI and machine learning
Data mining
Quantitative analysis
Algorithms
Databases and data structures

There are more, too. But what about job numbers?

One 2019 report from Indeed indicated a 344% rise in the demand for data science specialists since 2013 and a 29% year-over-year increase. The U.S. Bureau of Labor Statistics expects the operations research analysis and computer research information science fields to grow by 27% and 19%, respectively, by 2026, which they deem “much faster than average” growth.

Other job websites corroborate these numbers and note that it isn’t just tech companies looking for data scientists — job listings span the gamut of industries. With good reason, too. Data is the new oil, as they say — and it powers every corner of commerce and human enterprise.

Big Data Is Big Business (And Big Opportunity)

For job-seekers everywhere, the implications are clear — if you have a head for numbers and an eye for detail, then big data and data science could be extremely lucrative and stable career tracks.

The fact is, big data has become a competitive advantage among companies in virtually every industry, from health care to logistics to supply chain management to merchandising analysis to marketing campaigns.

Data fuels companies large and small and keeps them fed with historical and real-time insights upon which to build business decisions, launch newer and more personalized products, engage in expansions and acquisitions, fine-tune outreach and marketing, and much more.

But the problem, according to Michael Chui of the McKinsey Institute, is that “it’s super hard to find the right talent.” Moreover, this lack of talent results in companies not fully understanding the opportunities that big data represents in the first place, such as clearer customer insights and leaner operations. In other words, it’s a catch-22 — we need more big data expertise to prove the value of big data expertise.

Nevertheless, the word is getting out. Data science is “sexy” now, and smaller and mid-sized companies are adding their own slate of job listings alongside big names like Google, Microsoft and Apple. And the arms race for big data talent has placed such professionals in an incredible position to market themselves and find a company and career path that truly suits them.

So the question becomes — What’s the best way to close this talent gap and connect employers with qualified and motivated candidates?

How to Cultivate Big Data Talent

In a report called “The Quant Crunch,” IBM lays out the scope of the problem and provides several ways industry and educational entities can marshal their forces to address the impending shortage of qualified DSA job applicants.

First up is the challenge of communicating to K-12 students, college students and those looking for new opportunities. Since data science is everywhere, these jobs are everywhere too. IBM identified the following industries as those with the highest demand for DSA talent. The percentages express the proportion of company job openings which fall into the DSA category:

Finance and Insurance: 19%
Professional, Scientific, and Technical Services: 18%
Information services: 17%
Management of Companies and Enterprises: 13%
Manufacturing: 12%
Utilities: 10%
Wholesale Trade: 9%
Mining, Quarrying, and Oil and Gas Extraction: 9%
Public Administration: 7%

From insurance to scientific research to retail to public service, there isn’t a single industry that isn’t being disrupted by this shift in the job market. Moreover, notes IBM, the human race generates 2.5 quintillion bytes of data daily. Netflix is ostensibly an entertainment company, but every decision they make relies on the apparent mountain of user data they sit on. Uber isn’t a transportation company — they are, in the words of their leaders, a technology company.

We’re looking at a whole new kind of economy here. So how can we prepare people for it? IBM has some ideas on that front:

Start Data Literacy Education Early: Organizations realize the benefits of adding big data talent to their value chain. But the general public might be lagging behind. IBM recommends adding “baseline data literacy” to our educational institutions as early as junior high. Doing so would prepare graduates with a firm understanding of the basics and ready them for additional, more finely-tuned training in an industry that appeals to them.
More Focus on DSA Skills in Higher Learning: Higher learning institutions must, in turn, add new courses and degree tracks to their offerings — many already doing so. IBM observes that 42% of open data scientist positions require candidates with a graduate degree and 20% of positions require at least six years of hands-on experience. We need more colleges with classroom experiences and internship opportunities specifically tailored to big data.
Lower Hiring Requirements: Recognizing the value of driven and self-motivated problem-solvers, some major employers like Microsoft and Hasbro, are lowering their experience and education requirements for job applicants or splitting existing roles which had such requirements into multiple entry-level positions which do not.

Of course, even these recommendations aren’t enough — not when technology skills in general, and DSA skills specifically, change so regularly. eLearning has long been seen by many as one of the remedies for a rapidly changing economy and the emergence of brand-new disciplines. We already see an explosion in eLearning options for students even in K-12schools. This could help school districts more finely tune their course offerings, cast a wider net for high-quality educators and experiences, and stay more agile as the needs of the job market change further.

The importance of big data expertise couldn’t be more clear — for companies or for job-seekers. Thankfully, a more complete picture is emerging for how we can prepare many more people for a role in this exciting and quickly growing field.

The post Why Big Data Is a Big Opportunity for Employment appeared first on Artificial Intelligence.