Federated Analytics and the Rebirth of Data Science
Source – cio.com
Until recently, data scientists could design algorithms with the assumption that the data to be explored would be brought together in a single, centralized repository, such as a data lake or a cloud data center. But with the explosion of data and the rise of the Internet of Things (IoT), social media, mobility and other new sources of data, the paradigm is shifting. We can no longer assume that data can all be brought into a single repository for analysis.
In today’s world, data is inherently distributed. With IoT, for example, data is generated and often stored close to sensors or observation points. In many cases, moving this data into a single location before it can be analyzed can be a challenging proposition. And in some cases, data simply cannot be transmitted to a central location because of bandwidth constraints. In other cases, data movement is limited by governance, risk and compliance issues, along with restrictions imposed by security and privacy controls.
So where does that leave us? If we can’t bring data together for analysis, we have to take the analysis to the data. Analytics now needs to happen in many controlled, well-defined and well-secured places — at the edge, in the fog or core, and in the cloud or enterprise data centers. And some of these intermediate results may also need to be fused and analyzed together as well.
Many people are talking about this fundamental change. They get the big picture. However, relatively few people are talking about (1) how data science algorithms will be re-designed to reason and learn in a federated manner; (2) how analytics will be distributed close to where the data is collected; and (3) how the intermediate results will be aggregated and analyzed together to drive higher-order learning at scale. Those are much more challenging problems to solve. And this is where federated analytics enters the picture.
How federated analytics works
Federated analytics allows data scientists to generate analytical insight from the combined information in distributed datasets without requiring all the data to move to a central location, and while minimizing the amount of data movement in the sharing of intermediate results. When required, federated analytics must also respect compliance requirements and preserve privacy. In other words, the intermediate results shared cannot be reverse-engineered into the individual values used for their calculation.
For example, if a location shares the sum of 1,000 real values, it is intractable to infer the individual 1,000 values that led to the sum. In essence, this sum is privacy preserving in and of itself. This is not the case, however, if the values being summed are known to be positive integers and the sum adds up to 1,000, for example. In this case, the individual values could be automatically inferred to be 1 and the sum should not be shared if privacy must be preserved.
Under a federated analytics model, most data is analyzed close to where it is generated. Through analytics, learning happens at the edge, in the fog/core, and in the cloud or enterprise data center, and collective, collaborative learning happens at a global level.
To enable collaboration at scale, federated analytics allows the intermediate results of data analytics to be shared while the raw data remains in its locked-down location. When the shared results are combined and analyzed, a higher order of learning happens, and the owners of the individual datasets have the opportunity to compare their local results against the results of analyzing the combined pool of data.
An example use case
To make this story more tangible, let’s consider an example use case for collaboration under a federated analytics model.
With a focus on value-based outcomes, pharmaceutical companies are developing analytic tools to measure the effectiveness of certain treatments, in near-real-time, against cohorts of individuals around the world. The goal is to help identify common characteristics in patients who demonstrate better response, as well as in patients who have lower response.
Thanks to federated analytics, this global benchmarking can analyze data at the edge, close to where the data is collected and within geographical boundaries defined by regulatory compliance. Only the analytics logic itself and aggregated intermediate results traverse borders to facilitate data analysis across multi-cloud environments. This approach fully respects the privacy, governance, risk and compliance constraints for the data held by individual healthcare providers.
Consider, for example, a simple histogram that provides metrics on the effectiveness of a specific treatment based on the relative decrease in the cholesterol level. The analysis is done over a cohort of patients within a certain age group, with similar diagnostic profiles, and which initiated the treatment on the same day. Every week, the histogram indicates how many patients within the cohort analyzed actually decreased their cholesterol level by less than or equal to 10 points, between 11 and 20 points, between 21 and 30 points, between 31 and 40 points, and by more than 40 points.
Using federated analytics, each one of the thousands of participating clinical trial sites can simply analyze their data locally and share a privacy-preserving histogram of their local results.
This profile consists of a set of five key value pairs that can be represented as follows:
At any location, the intermediate results can then be combined and a global histogram generated.
It is important to note that these intermediate results are privacy-preserving and the individual data for each patient, such as age, individual cholesterol value, and individual historical profile, are not shared and remain within their location. In addition, this approach is capable of reducing any number of entries on each site to simply five key-value pairs, immensely reducing the amount of data shared and the amount of bandwidth utilized.
The path to federated analytics
Federated analytics is the future for organizations that want to gain value from data that is inherently distributed. That we know for sure. But how do you get there? Here are three key steps in moving forward:
1. Gather a team of data scientists who can help you redesign your algorithms, especially deep learning ones, to work in a federated manner. These bright minds can help you see and approach your analytics in a new way.
2. Create a metadata layer to serve as the foundation for federated analytics. This layer makes data that is scattered around the world locatable, accessible and useable for analysis by data scientists. For more on this topic, see my recent blog titled Building a Global Meta-data Fabric to Accelerate Data Science.
3. Automate your compute framework via a World Wide Herd (WWH) — a concept that creates a global network of Apache™ Hadoop® instances that work together to function as a single virtual computing cluster. WWH orchestrates the execution of distributed and parallel computations on a global scale, pushing analytics to where the data resides. For a closer look at this concept, see another of my recent blogs titled Distributed Analytics Meets Distributed Data.
Here are the key takeaways: Data science as we know it is changing. Yesterday’s practices, based on centralized data repositories, won’t work in a time when data is inherently distributed and growing at exponential rates. We now need an all-new approach to data science.