data quality Archives - Artificial Intelligence

Overcoming the Challenges in Training Generative AI Models: A Comprehensive Guide

Maruti Kr. — Wed, 19 Jun 2024 12:14:59 +0000

Training generative AI models presents a variety of challenges and limitations. Key among these are:

Data Quality and Quantity

Data Availability: Generative models often require vast amounts of data to learn effectively. Accessing large, diverse datasets can be challenging, particularly in specialized domains.
Data Quality: High-quality, well-labeled data is crucial. Poor-quality data can lead to biased or inaccurate models. Ensuring data cleanliness, dealing with missing values, and addressing inconsistencies are significant hurdles.
Data Privacy and Security: Many datasets contain sensitive information. Ensuring data privacy and security while maintaining data utility for training is a complex issue, especially with regulations like GDPR.

Computational Resources

High Computational Requirements: Training state-of-the-art generative models, such as GPT or GANs, demands substantial computational power. This includes powerful GPUs or TPUs, large memory, and extensive storage capabilities.
Energy Consumption: The computational resources required translate into high energy consumption, raising concerns about the environmental impact and the sustainability of large-scale AI models.

Model Complexity

Architecture Design: Choosing the right model architecture is crucial and non-trivial. It involves selecting appropriate neural network structures, layers, and parameters, which requires deep expertise and experimentation.
Hyperparameter Tuning: Optimizing hyperparameters (learning rate, batch size, etc.) is essential for model performance but is often a time-consuming and resource-intensive process.

Training Stability and Performance

Training Instability: Generative models, especially GANs, can suffer from instability during training. Issues such as mode collapse, vanishing gradients, and non-convergence are common.
Scalability: As models and datasets grow, ensuring scalability of the training process becomes challenging. Efficient parallelization and distributed training are necessary but complex to implement.

Interpretability and Evaluation

Model Interpretability: Understanding and interpreting the inner workings of generative models is difficult, making it hard to diagnose and fix issues.
Evaluation Metrics: Evaluating generative models is less straightforward compared to discriminative models. Metrics like Inception Score (IS) and Frechet Inception Distance (FID) are used, but they have limitations and do not always correlate with human judgment.

Ethical and Social Implications

Bias and Fairness: Generative models can inadvertently learn and propagate biases present in training data, leading to unfair or unethical outcomes.
Misuse Potential: Generative models can be used to create misleading or harmful content (e.g., deepfakes), raising ethical concerns and necessitating robust safeguards.

Development and Maintenance Costs

Resource Investment: Developing state-of-the-art generative models requires significant financial investment in terms of hardware, software, and human expertise.
Continuous Updates: Maintaining and updating models to improve performance, address biases, and incorporate new data is an ongoing challenge.

Addressing these challenges requires a multidisciplinary approach, combining advances in machine learning, data engineering, computational infrastructure, and ethical frameworks.

The post Overcoming the Challenges in Training Generative AI Models: A Comprehensive Guide appeared first on Artificial Intelligence.

Getting Serious About Data and Data Science

aiuniverse — Tue, 29 Sep 2020 07:36:12 +0000

Source: sloanreview.mit.edu

Data science, including analytics, big data, and artificial intelligence, is no longer a novel concept. Nor is the important foundation of high-quality data. Both have contributed to impressive business successes — particularly among digital natives — yet overall progress among established companies has been painfully slow. Not only is the failure rate high, but companies have also proved unable to leverage successes in one part of the business to reap benefits in other areas. Too often, progress depends on a single leader, and it slows dramatically or reverses when that individual departs the company. In addition, companies are not seizing the strategic potential in their data. We’d estimate that less than 5% of companies use their data and data science to gain an effective competitive edge.

Over the years, we have worked with dozens of companies on their data journeys, advising them on the approaches, techniques, and organizational changes needed to succeed with data, including quality, data science, and AI. From our perspective, these are the two biggest mistakes organizations make:
They underinvest in the organization (people, structure, or culture), process, and the strategic transformations needed to get on offense — in other words, to take full advantage of their data and the data analytics technologies at their disposal.
They address data quality improperly, which leads them to waste critical resources (time and money) dealing with mundane issues. Bad data, in turn, breeds mistrust in the data, further slowing efforts to create advantage.

Although the details at each company differ, seeing data too narrowly — as the province of IT or the data science organization, not of the entire business — is a recurring theme. This causes companies to overlook the transformative potential in data and therefore underinvest in the organizational, process, and strategic changes cited above. Similarly, they blame technology for their quality woes and failures to capitalize on data, when the real problem is poor management.

We’ve all observed how companies behave when they are truly serious about something — how the goal changes from incremental progress to rapid transformation; how they muster both breadth and depth of resources; how they align and train people; how they communicate new values and new ways of working; and how senior leaders drive the effort. Indeed, it almost seems as if companies go overboard when they are truly serious about something. Amazon’s Project D initiative to develop the Echo/Alexa smart speaker is a great illustration of that seriousness, with hundreds of employees, several startup acquisitions, heavy CEO involvement, and no expense spared. DBS Bank’s journey to being named World’s Best Digital Bank by Euromoney is another good example. The company’s CEO, Piyush Gupta, said the following upon receiving that award in 2018:

At DBS, we believe that banks tomorrow will look fundamentally different from banks today. That’s why we have spent the past three years deeply immersed in the digital agenda. This has been an all-encompassing journey, whether it is changing the culture and mindsets of our people, re-architecting our technology infrastructure, or leveraging big data, biometrics, and AI to make banking simple and seamless for customers.

The contrast with most companies’ data programs is stark — one can only conclude that many are not yet serious about data and data science. For those only beginning to explore data, this may be understandable. But, if you’ve been at it for three years or more, it is time to either get serious in addressing mistakes or invest your resources elsewhere — and expect to lose out to competitors.

Stop Wasting Effort on Data Quality

The obvious approach to addressing these mistakes is to identify wasted resources and reallocate them to more productive uses of data. This is no small task. While there may be budget items and people assigned to support analytics, AI, architecture, monetization, and so on, there are no budgets and people assigned to waste time and money on bad data. Rather, this is hidden away in day-in, day-out work — the salesperson who corrects errors in data received from marketing, the data scientist who spends 80% of his or her time wrangling data, the finance team that spends three-quarters of its time reconciling reports, the decision maker who doesn’t believe the numbers and instructs his or her staff to validate them, and so forth. Indeed, almost all work is plagued by bad data

The secret to wasting less time and money involves changing one’s approach from the current “buyer/user beware” mentality, where everyone is left on their own to deal with bad data, to creating data correctly — at the source. This works because finding and eliminating a single root cause can prevent thousands of future errors and eliminate the need to correct them downstream. This saves time and money — lots of it! The cost of poor data is on the order of 20% of revenue, and much of that expense can be eliminated permanently. That’s more than enough to fund the needed investments.

Get On Offense

Now consider the budgets for AI (as an example of “offense-minded” data efforts). It appears to us that, in many cases, the data science work to develop a new algorithm is funded well enough. Algorithm development is getting cheaper anyway, given that automated machine learning programs are doing more of the work. But useful algorithms die on the vine because the work to build processes, train people, address fear of change, and adapt the culture is substantially underfunded. Based on our experience, a good rule of thumb is that you should estimate that for every $1 you spend developing an algorithm, you must spend $100 to deploy and support it. A few of these dollars will go to building algorithms into work processes, and many more to training, building a culture that embraces data, and change management. Most companies aren’t spending this money yet, and it explains their lack of production AI deployments.

Make Bold Moves

What tangible steps should business leaders take to demonstrate that they are serious about data? First, they should more tightly couple their business and data strategies with an eye toward driving revenue growth. From the data perspective, opportunity abounds in fully exploiting proprietary data, driving analytics into every nook and cranny of the company, and augmenting virtually every decision using AI. You cannot — and should not — do them all, so you must select those most closely aligned with your business strategies. One sign that you’re on the right track is that there will be fewer data efforts. But those you do have will be far larger, more comprehensive, and more closely managed.

Second, get everyone fully engaged. After all, everyone is technically involved in your data efforts already. They interpret data correctly, or they do not; they create data correctly, or not; they use data to improve their work, or not; and they contribute to larger data initiatives, or not. Today, there are far too many “nots.” Similarly, managers push back against the nots, or they do not, and more senior leaders get in front of them, or not. So you must reach out to people, educate them, and enroll them in the effort, even as you grow increasingly intolerant of the inefficiencies stemming from bad data. This is going to take some time. One sign that you’re on the right track is that morale will improve. In our experience, once people get the hang of it, most of them find data work quite enjoyable. Importantly, in the data space, talent wins.

Third, draw a clear distinction between the management of data and the management of technology. Just as a movie is a different sort of asset than streaming technology, data and tech are different sorts of assets. Each demands its own specialized management. Yet today, too many companies subordinate data to tech. The result is that topics such as data architecture do not get the attention they deserve, leading to such absurdities as a bank having 130,000 databases, not including spreadsheets. Meanwhile, technology programs spend too much time dealing with the consequences of having systems that don’t talk to one another and spending too little time introducing new technologies to employees. One sign that you’re on the right track is that technology departments will become more effective and, in time, strategic.

Finally, now is a good time to start thinking about the longer-term roles data will play in your company. It is easy enough to recite the mantra “Data is the new oil.” And according to The Economist, data is now worth up to $2 trillion in the U.S. alone. But, of course, not all data is created equally. Some data — such as proprietary data, data needed to run the company, and data associated with other key assets — is so important that it should be treated as an asset in its own right. At the very least, you should make sure that end-to-end accountabilities for this data are clear.

We fully recognize how challenging these recommendations will prove to be. Yet they signal great opportunity, especially for the first companies in their sectors to embrace them. The needed approaches, methods, and technologies are widely available and have proved themselves over and over among digital natives and at the department level for established companies. It is clear enough that the future depends on data, so sooner or later, you have no real choice. As in all things, audentes Fortuna iuvat — fortune favors the brave.

The post Getting Serious About Data and Data Science appeared first on Artificial Intelligence.

Artificial Intelligence Tool Diagnoses Alzheimer’s with 95% Accuracy

aiuniverse — Tue, 01 Sep 2020 08:22:09 +0000

Source: healthitanalytics.com

A team from Stevens Institute of Technology has developed an artificial intelligence tool that can diagnose Alzheimer’s disease with more than 95 percent accuracy, eliminating the need for expensive scans or in-person testing.

In addition, the algorithm is also able to explain its conclusions, enabling human experts to check the accuracy of its diagnosis.

Alzheimer’s disease can impact a person’s use of language, the researchers noted. For example, people with Alzheimer’s tend to replace nouns with pronouns, and they can express themselves in a very roundabout, awkward way.

The team designed an explainable AI tool that uses attention mechanisms and a convolutional neural network to accurately identify well-known signs of Alzheimer’s, as well as subtle linguistic patterns that were previously overlooked.

Researchers trained the algorithm using texts composed by both healthy subjects and known Alzheimer’s sufferers describing a drawing of children stealing cookies from a jar. The team converted each individual sentence into a unique numerical sequence, or vector, representing a specific point in a 512-dimensional space.

This kind of approach allows even complex sentences to be assigned a concrete numerical value, making it easier to analyze structural and thematic relationships between sentences.

Using those vectors along with handcrafted features, the AI gradually learned to spot differences between sentences composed by healthy or unhealthy individuals, and was able to determine with significant accuracy how likely any given text was to have been produced by a person with Alzheimer’s.

“This is a real breakthrough,” said the tool’s creator, K.P. Subbalakshmi, founding director of Stevens Institute of Artificial Intelligence and professor of electrical and computer engineering at the Charles V. Schaefer School of Engineering & Science.

“We’re opening an exciting new field of research, and making it far easier to explain to patients why the AI came to the conclusion that it did, while diagnosing patients. This addresses the important question of trustability of AI systems in the medical field.”

The AI system can also incorporate new criteria that may be identified by other research teams in the future, making the algorithm increasingly more accurate over time.

“We designed our system to be both modular and transparent,” Subbalakshmi explained. “If other researchers identify new markers of Alzheimer’s, we can simply plug those into our architecture to generate even better results.”

In the future, AI tools may be able to diagnose Alzheimer’s using any text, from emails to social media posts. However, to develop such an algorithm, researchers would need to train it on many different kinds of texts produced by known Alzheimer’s sufferers instead of just picture descriptions.

While this kind of data is not yet available, increasing access to this kind of information could lead to the development of accurate, comprehensive AI tools.

“The algorithm itself is incredibly powerful,” Subbalakshmi said. “We’re only constrained by the data available to us.”

The researchers’ next steps will be gathering new data that will help the algorithm diagnose patients with Alzheimer’s disease based on speech in languages other than English. The team is also uncovering ways in which other neurological conditions, such as aphasia, stroke, traumatic brain injuries, and depression, can impact language use.

“This method is definitely generalizable to other diseases,” said Subbalakshmi. “As we acquire more and better data, we’ll be able to create streamlined, accurate diagnostic tools for many other illnesses too.”

Researchers expect that providers can use this AI tool to more accurately diagnose Alzheimer’s, leading to earlier treatment and reduced healthcare costs.

“This is absolutely state-of-the-art,” said Subbalakshmi. “Our AI software is the most accurate diagnostic tool currently available while also being explainable.”

The post Artificial Intelligence Tool Diagnoses Alzheimer’s with 95% Accuracy appeared first on Artificial Intelligence.

Six Key Steps to Ensure Data Quality for Artificial Intelligence

aiuniverse — Tue, 11 Feb 2020 05:47:11 +0000

Source: solutionsreview.com

As a growing number of companies are looking to build out and leverage artificial intelligence solutions across their organization, they’re often delayed due to poor data quality that exist across their business operations. This quality deficiency prevents them from proceeding with their intended AI rollout. Once AI is fully implemented, it can improve data quality throughout a company.

Being faced with data quality issues forces a company to shift priorities and resources from implementing AI to fixing these quality shortcomings before they can proceed. This means extensive time delays, allocation of resources, and a slow draining of the AI budget.

The magnitude of this problem is multiplied by the number of data sources a company possesses, and even more so when you consider the ever-growing volume of new data pouring in. By some measures, the amount of new data available to a company doubles every two years. Having an abundance of data is a strategic advantage and should be looked upon in this manner, and a data infrastructure should be in place to support all of it. One sure way to take advantage of all this data is with the application of AI.

To help address this challenge of having data quality in place that is suitable for AI, organizations should look to appropriately design or redesign its underlying data architecture. Following are a few key steps that can be taken.

Consolidate your data
Consolidating and integrating all corporate data into a centralized data hub provides a single platform for all data. This helps ensure one version of the truth and a consistent data home for all users throughout an organization regardless of department.
Connect your data
Having connectivity and a data exchange enables the retrieval of data in its raw form before cleansing. This allows for single connectivity to all data sources now and into the future.
Use a modern data warehouse
By using a modern data warehouse (MDW), users can modify and enrich their data so that data issues can be resolved once. Data that resides in various systems can be rationalized and after rationalization, golden records can be created. And with an MDW, historical data can be preserved.
Consider a semantic layer
Within a data mart or what some call a semantic layer, data can be governed and prepared for any visualization tool. In addition, with a shared data model, all users can see the same data regardless of which data visualization tool is being used.
Deploy all-in-one data management software
Too many organizations have complicated systems in place which can obscure data management and can lead to data quality issues. Having numerous tools requires extensive management and coordination to ensure synchronization among all the various tools. One method recommended to fix this is to use an all-encompassing data management platform that eliminates the usage of multiple discrete tools.
Automate data management procedures
With an automated data management platform, time-consuming, hand-written code is eliminated in favor of automation. This eliminates the time needed for manual coding – which improves quality. It also frees up time to work on other issues regarding data quality and ultimately the bigger AI program.

AI, once fully in place after data quality is secured, can also improve the quality of future data. With an established AI program, the inherent intelligence in AI can be used to automate the gathering and collecting of needed data and automatically enter the data – removing the need for manual data entry. And when you remove manual tasks you generally improve data quality. AI can also identify data errors and anomalies, remove duplicate or outdated records, and identify third-party data sources that can provide value related to the data model.

With all this in mind, desiring to have AI within an enterprise is certainly a worthy cause, but it starts with having solid data quality across the enterprise. And with AI in place, not only can organizations enjoy all the benefits of AI such as predicting trends, identifying new opportunities, and answering tough business questions, but they can also be assured that improvement of the level of quality of their future data will transpire as well.

The post Six Key Steps to Ensure Data Quality for Artificial Intelligence appeared first on Artificial Intelligence.

Machine Learning in Drug Development Requires Data Access, Standards

aiuniverse — Thu, 23 Jan 2020 07:21:31 +0000

Source: healthitanalytics.com

January 22, 2020 – Machine learning algorithms have the potential to accelerate and refine the drug development process, but the industry should expand data access and create consistent data standards to ensure drug companies can fully leverage these tools, according to a report from the Government Accountability Office (GAO).

Drug companies spend ten to 15 years bringing a drug to market, often at high costs. Only about one in every 10,000 chemical compounds initially tested for drug potential makes it through the research and development pipeline, GAO noted, and is then approved by the FDA for marketing in the US. Machine learning tools could accelerate and improve the drug development process.

“Machine learning can make drug development more efficient and effective, decreasing the time and cost required to bring potentially more effective drugs to market,” GAO said.

“Both of these improvements could save lives and reduce suffering by getting drugs to patients in need more quickly. Lower research and development costs could also allow researchers to invest more resources in disease areas that are currently not considered profitable to pursue, such as rare or orphan diseases.”

Although drug companies already use machine learning throughout the drug development process, there are several challenges that hinder its advancement in this area, including barriers to data access and sharing.

“According to one industry representative, collecting data from the early drug discovery phase can be cost prohibitive. This representative said that certain health-related data may cost tens of thousands of dollars, as compared to just cents for other consumer related data that many technology companies use,” GAO stated.

“Data sharing also presents unique legal issues. According to stakeholders, privacy laws such as HIPAA can make it difficult for drug companies, especially those that are not regulated by HIPAA, to share or access data.”

To increase data sharing and access, GAO recommended that policymakers create mechanisms or incentives to share data held by private or public sectors, while also ensuring patient information is protected.

“To promote greater availability of data, policymakers could consider forming or facilitating research consortia that allow for secure data sharing,” the organization wrote.

“Policymakers could also consider creating a data repository through encouraging an industry-driven solution, establishing a public-private partnership, or creating a repository of all data under their control.”

In creating new ways to share and access data, stakeholders should ensure they adhere to laws around information exchange.

“Improper data sharing or use could have legal consequences. Increased data sharing could therefore require a careful review of the legal ramifications, because data are often gathered through a wide variety of mechanisms and governed by multiple legal frameworks,” GAO advised.

In addition to data sharing and access, policymakers will need to address the lack of quality data in the drug development process.

“Machine learning requires a large amount of accurate and representative data. This poses a unique challenge in drug development, as much of the data were not originally collected with machine learning in mind and may not be machine-readable or model-ready,” GAO wrote.

“Furthermore, according to an industry representative, data collected across different organizations and environments come in different formats, and this lack of standardization in data quality is a barrier.”

READ MORE: Data Standards, Governance Will Address Social Determinants of Health

Overcoming data quality issues will require policymakers to collaborate with appropriate stakeholders to establish data standards, GAO said.

“For example, a standard that defines synthetic data and how they can be used can help reduce bias by allowing researchers to generate data that could be used to better represent currently underrepresented communities,” the agency stated.

“Similarly, a standard data format for uploading and sharing data across platforms could reduce the need for data scientists to spend time converting data sets to machine-readable formats.”

GAO also named drug development research gaps as an obstacle to machine learning use.

“Research gaps present a significant challenge to advancing the use of machine learning in drug development. These gaps fall into two broad categories: gaps in understanding of fundamental biology and chemistry, and gaps in domain-specific machine learning research,” GAO said.

“Experts in the field have noted that addressing these issues may be transformational for future applications of machine learning in drug development.”

GAO suggested that policymakers promote basic research to generate new and better data to improve understanding of machine learning in drug development.

“Policymakers could promote the field in multiple ways, including approaches such as support for intramural research, grants, or other subsidies. Policymakers could choose to use one of these approaches or combine them,” GAO said.

“Policymakers could also support collaboration across sectors. The Machine Learning for Pharmaceutical Discovery and Synthesis Consortium (MLPDS) is a collaboration between large drug companies such as Pfizer, Merck, and Novartis with the Chemical Engineering, Chemistry, and Computer Science departments at the Massachusetts Institute of Technology, and has published a variety of papers at the intersection of machine learning and drug development.”

With these recommendations, policymakers and other stakeholders can advance the use of machine learning in drug development, refining and speeding the process to benefit patients.

The post Machine Learning in Drug Development Requires Data Access, Standards appeared first on Artificial Intelligence.

Artificial Intelligence Success Requires Human Validation, Good Data

aiuniverse — Wed, 27 Nov 2019 07:22:55 +0000

Source: healthitanalytics.com

November 26, 2019 – As the foundation of nearly every healthcare trend, process, and solution, data has a vitally important role to play in care delivery and success.

From risk stratification to chronic disease management, precision medicine and medical research, data is at the center of everyday healthcare tasks and broader industry improvements, making it an incredibly valuable resource for organizations.

“If you talk to any data scientist, they’ll tell you that the more quality, scientifically validated data they have, the more likely they’re going to be able to generate useful trends and insights,” said Todd Frech, CIO at Press Ganey.

“The core of everything we do is taking the vast amounts of data that we collect and creating value for hospitals that are trying to improve their operations.”

With healthcare quickly becoming a digital industry, more and more entities are gathering meaning from this big data using artificial intelligence and other advanced analytics technologies.

“The challenge that we’re trying to overcome is that we have more data than a human can process, and we’re trying to develop insights based on those volumes of data. This issue is a natural fit for AI, so the use of this technology is going to continue to accelerate,” Frech said.

“AI can augment humans’ understanding of data, not only from the perspective of generating new insights, but also in generating those insights faster than a typical human analyst processes.”

There are countless examples of AI outperforming humans in analyzing and extracting insights from clinical data. The technology’s potential to transform the industry has led to concerns about robots encroaching on healthcare jobs, creating an environment run entirely by machines and devoid of human interaction.

While AI may disrupt standard care delivery, it’s unlikely that advanced analytics tools will completely take over the role of clinicians. In a field where high-stakes situations and sensitive data are routine, technology can’t simply be left to operate on its own, Frech stressed.

“AI is going to play a bigger part in healthcare, and humans will also continue to play a big part,” he said.

“We can’t just assume that AI is making the right decisions without human validation. There’s a trend that you’re going to see more – what’s called AI augmentation, or human augmentation with AI, more than what you would call complete robotic AI, meaning that you’re letting the AI make decisions without human intervention.”

Recent research has demonstrated that when implementing AI tools, human intervention can lead to optimal results. A study conducted by a team at NYU School of Medicine and the NYU Center for Data Science showed that combining AI with analysis from human radiologists significantly improved breast cancer detection.

Using an AI augmentation approach could also help organizations analyze and measure unstructured data.

“We collect hundreds of thousands of survey data points in the forms of responses to questions, as well as unstructured data in the form of comments. We use AI to look at the comments that come in our surveys. Those comments are obviously in the form of unstructured text, and they convey information on perception of the providers and of the service,” explained Frech.

“Those aren’t yes or no questions. Those are questions that require some soft skills to interpret. We can use AI to do an initial sentiment analysis, and that provides a way for us to really measure this type of data, which is not as binary as some of the data we typically evaluate.”

However, data-driven technologies can’t improve care if they’re fed inaccurate or incomplete information – in fact, this could have the opposite effect.

“Never underestimate the importance of data quality,” said Frech. “No AI tool is going to work well without high-quality data. People talk about data lakes and unstructured data, and these things are great tools. But without quality data, you’re going to have more of a data swamp than a data lake.”

“If you’re trying to use AI to gather insights without high-quality data, obviously the results aren’t going to pan out. Or even worse, the results could potentially offer dangerous recommendations that could negatively impact people,” he added.

Having a solid data ecosystem ensures that any innovative tools will contribute positively to a health system’s operations, Frech said, as well as communicating openly with other organizations.

“When implementing artificial intelligence or any new technology, make sure that the foundation is strong. Make sure that there’s testing and validation. If that doesn’t happen, there is potential for organizations to take steps backward rather than forward,” he said.

“Find opportunities with your peers, find case studies, talk to people who are using the technology. The more that your organizations can collaborate and learn from each other, the more ideas and successes will increase.”

AI has massive potential to revolutionize the way providers deliver care and make treatment decisions. The road to industry-wide adoption won’t be without its challenges, but the technology will likely make its way into regular clinical care.

“There are a lot of different ways to use AI, and there has been a lot of experimentation. Over time, there will be more and more successes, and those successes will come in fits and starts, depending on how the data in the market mature. There’s too much investment in AI right now to not have some of those successes come into play,” Frech concluded.

The post Artificial Intelligence Success Requires Human Validation, Good Data appeared first on Artificial Intelligence.

NIH Promotes Big Data to Enhance Eye Disease Research

aiuniverse — Thu, 01 Aug 2019 06:03:24 +0000

Source: healthitanalytics.com

July 31, 2019 – Improving collaboration between specialists and integrating multiple datasets to leverage big data will be key for advancing research for dry age-related macular degeneration (AMD), according to a new report from a National Institute of Health (NIH) working group.

Over 11 million people in the United States are diagnosed with AMD, an eye disease that ultimately results in blindness. It is the leading cause of blindness among individuals 65 years of age and older.

The disease can manifest in one of two forms: neovascular (wet) or non-neovascular (dry). While the neovascular form progresses more rapidly, there are several known and proven treatments for the disease. There are no preventive measures for dry AMD nor treatment options.

Dig Deeper

As Artificial Intelligence Matures, Healthcare Eyes Data Aggregation
Is Healthcare Any Closer to Achieving the Promises of Big Data Analytics?
New Project Puts an Actuarial Eye on Big Data, Healthcare Costs

“The working group thoroughly assessed what is known about dry AMD pathobiology, and the recommendations will be informative for considering future NEI research priorities to align with promising pathways for discovering therapeutic targets,” said Director of National Eye Institute (NEI), Paul Sieving, MD, PhD, in an earlier news release.

The working group recommended a systems biology approach to disease treatment, an integration of genomic, preclinical, medical, pharmacological, and clinical data to inform modeling of the disease progression. Synthesizing big data from all these areas including tissue samples from clinical trials will help inform predictive modeling which can then be used to inform individual patient care.

A personalized approach to disease management may also be helpful, the working group recommended. Such an approach should consider the disease stage, progression, and individual risk factors to provide preventive and treatment strategies specific to the patient, the report said. Collaborating will all points of care will allow a multidisciplinary team to use a patient’s unique clinical, imaging, and genomic data to treat the disease.

“We propose that researchers utilize a systems biology approach, integrating the big data available from clinical registries and various fields of biology known as ‘omics’ to develop better models and ultimately treatments for patients with this blinding disease,” stated report co-author Joan W. Miller, MD.

Due to a lack of preventive strategies and treatment options for dry AMD, the working group noted the need for improved understanding of the disease pathology and promoted clinical trial investigations to do so. Previous research has shown a genetic link to the disease as well as several lifestyle factors including smoking, but there is no work examining the effects these factors have on dry AMD.

A better understanding of how these factors impact the disease will help providers be better informed to watch for risk factors and promote inventive preventive strategies. Such understanding only comes from examining data and promoting the use of big, integrated data sources to help investigators use multiple sources to answer their questions.

Effective disease management will need multiple targets that differ based on the disease stage progression, the report notes. A strategy overhaul needs to take place that focuses on large-scale, collaborative, systems biology in order to effectively treat the disease.

“This approach would integrate basic, genomic, pre-clinical, medical, pharmacological, and clinical data into mathematical models of pathological processes at different stages of dry AMD in order to ask how relevant individual components act together within the living system,” Miller said.

The working group was appointed by the National Advisory Eye Council, a 12-member panel that establishes guidelines for the NEI under the NIH. The group was charged with a multilayered goal: to raise public health awareness about the impact of dry AMD, review the current state of research about the disease for a better understanding of its pathology, propose future research directions, encourage scientists to focus on AMD, and promote collaboration among a network of specialized providers.

The post NIH Promotes Big Data to Enhance Eye Disease Research appeared first on Artificial Intelligence.

How dataops improves data, analytics, and machine learning

aiuniverse — Fri, 21 Jun 2019 10:51:07 +0000

Source:- infoworld.com

A dataops team will help you get the most out of your data. Here’s how people, processes, technology, and culture bring it all together Have you noticed that most organizations are trying to do a lot more with their data?

Businesses are investing heavily in data science programs, self-service business intelligence tools, artificial intelligence programs, and organizational efforts to promote data-driven decision making. Some are developing customer facing applications by embedding data visualizations into web and mobile products or collecting new forms of data from sensors (Internet of Things), wearables, and third-party APIs. Still others are harnessing intelligence from unstructured data sources such as documents, images, videos, and spoken language.

[ The essentials from InfoWorld: What is big data analytics? Everything you need to know • What is data mining? How analytics uncovers insights. | Go deep into analytics and big data with the InfoWorld Big Data and Analytics Report newsletter. ]

Much of the work around data and analytics is on delivering value from it. This includes dashboards, reports, and other data visualizations used in decision making; models that data scientists create to predict outcomes; or applications that incorporate data, analytics, and models.

What has sometimes been undervalued is all the underlying data operations work, or dataops, that it takes before the data is ready for people to analyze and format into applications to present to end users.

Dataops includes all the work to source, process, cleanse, store, and manage data. We’ve used complicated jargon to represent different capabilities such as data integration, data wrangling, ETL (extract, transform and load), data prep, data quality, master data management, data masking, and test data management.

The post How dataops improves data, analytics, and machine learning appeared first on Artificial Intelligence.

Big data systems up ante on data quality measures for users

aiuniverse — Fri, 06 Oct 2017 07:35:14 +0000

Source – searchdatamanagement.techtarget.com

NEW YORK — In the rush to capitalize on deployments of big data platforms, organizations shouldn’t neglect data quality measures that can ensure what’s used in analytics applications is clean and trustworthy, experienced IT managers said at the 2017 Strata Data Conference here last week.

Several speakers pointed to data quality as a big challenge in their big data environments — one that required new processes and tools to help get a handle on quality issues, as both the volumes of data being fed into corporate data lakes and use of the info by data scientists and other analysts grow.

“The more of the data you produce is used, the more important it becomes, and the more important data quality becomes,” said Michelle Ufford, manager of core innovation for data engineering and analytics at Netflix Inc. “But it’s very, very difficult to do it well — and when you do it well, it takes a lot of time.”

Over the past 12 months, Ufford’s team worked to streamline the Los Gatos, Calif., company’s data quality measures as part of a broader effort to boost data engineering efficiency based on a “simplify and automate” mantra, she said during a Strata session.

A starting point for the data-quality-upgrade effort was “acknowledging that not all data sets are created equal,” she noted. In general, ones with high levels of usage get more data quality checks than lightly used ones do, according to Ufford, but trying to stay on top of that “puts a lot of cognitive overhead on data engineers.” In addition, it’s hard to spot problems just by looking at the metadata and data-profiling statistics that Netflix captures in an internal data catalog, she said.

Calling for help on data quality

To ease those burdens, Netflix developed a custom data quality tool, called Quinto, and a Python library, called Jumpstarter, which are used together to generate recommendations on quality coverage and to set automated rules for assessing data sets. When data engineers run Spark-based extract, transform and load (ETL) jobs to pull in data on use of the company’s streaming media service for analysis, transient object tables are created in separate partitions from the production tables, Ufford said. Calls are then made from the temporary tables to Quinto to do quality checks before the ETL process is completed.

In the future, Netflix plans to expand the statistics it tracks when profiling data and implement more robust anomaly detection capabilities that can better pinpoint “what is problematic or wrong” in data sets, Ufford added. The ultimate goal, she said, is making sure data engineering isn’t a bottleneck for the analytics work done by Netflix’s BI and data science teams and its business units.

Improving data consistency was one of the goals of a cloud-based data lakedeployment at Financial Industry Regulatory Authority Inc., an organization in Washington, D.C., that creates and enforces rules for financial markets. Before the big data platform was set up, fragmented data sets in siloed systems made it hard for data scientists and analysts to do their jobs effectively, said John Hitchingham, director of performance engineering at the not-for-profit regulator, more commonly known as FINRA.

A homegrown data catalog, called herd, was “a real key piece for making this all work,” Hitchingham said in a presentation at the conference. FINRA collects metadata and data lineage info in the catalog; it also lists processing jobs and related data sets there, and it uses the catalog to track schemas and different versions of data in the big data architecture, which runs in the Amazon Web Services (AWS) cloud.

To help ensure the data is clean and consistent, Hitchingham’s team runs validation routines after it’s ingested into Amazon Simple Storage Service (S3) and registered in the catalog. The validated data is then written back to S3, completing a process that he said also reduces the amount of ETL processing required to normalize and enrich data sets before they’re made available for analysis.

Data quality takes a business turn

The analytics team at Ivy Tech Community College in Indianapolis also does validation checks as data is ingested into its AWS-based big data system — but only to make sure the data matches what’s in the source systems from which it’s coming. The bulk of the school’s data quality measures are now carried out by individual departments in their own systems, said Brendan Aldrich, Ivy Tech’s chief data officer.

“Data cleansing is a never-ending process,” Aldrich said in an interview before speaking at the conference. “Our goal was, rather than getting on that treadmill, why not engage users and get them involved in cleansing the data where it should be done, in the front-end systems?”

That process started taking shape when Ivy Tech, which operates 45 campuses and satellite locations across Indiana, deployed the cloud platform and Hitachi Vantara’s Pentaho data integration and BI software in late 2013 to give its business users self-service analytics capabilities. And it was cemented in July 2016 when the college hired a new president who mandated that business decisions be based on data, Aldrich said.

The central role data plays in decision-making gives departments a big incentive to ensure information is accurate before it goes into the analytics system, he added. As a result, data quality problems are being found and fixed more quickly now, according to Aldrich. “Even if you’re cleansing data centrally, you usually don’t find [an issue] until someone notices it and points it out,” he said. “In this case, we’re cleansing it faster than we were before.”

The post Big data systems up ante on data quality measures for users appeared first on Artificial Intelligence.