Three Steps to Successful Collaboration with Data Scientists
Source – eos.org
The vast and rapidly increasing supply of new data in the Earth sciences creates many opportunities to gain scientific insights and to answer important questions. Data analysis has always been an integral component of research and education in the Earth sciences, but mainstream Earth scientists may not yet be fully aware of many recently developed methods in computer science, statistics, and math.
The fastest way to put these new methods of data analysis to use in the Earth sciences is for Earth scientists and data scientists to collaborate. However, those collaborations can be difficult to initiate and even more difficult to maintain and to guide to successful outcomes. Here we break down the collaboration process into steps and provide some guidelines that we have found useful for efficient collaboration between Earth scientists and data scientists. We base our structure on discussions with many researchers working in similar areas and on our own experience, gained from more than 6 years of collaboration on related topics.
Knowledge Discovery from Data
The data analysis methods we are concerned with are those that seek to identify new knowledge: discovering patterns, revealing interactions between different processes, or yielding other types of insights that can be interpreted by Earth scientists and eventually attributed to some physical effect. We refer to these types of methods as knowledge discovery from data.
Some of these new methods come from the fields of deep learning (using artificial neural networks), causal discovery (using probabilistic graphical models that describe cause-and-effect relationships), and self-organizing maps. Artificial neural networks, for example, have been used to predict air quality and the occurrence of severe weather: These networks have been used to derive nonlinear transfer functions that convert observations to important geophysical parameters. Causal discovery has been used to identify information flow (the pathways between cause and effect) in the atmosphere around the globe. Self-organizing maps are becoming a preferred tool to classify recurring atmospheric flow features such as jet streams.
Meet the Scientists
How do Earth scientists and data scientists get acquainted and begin to work together? Meet Peter and Andrea, our two companions in this article (Figure 1). Peter is an Earth scientist. He studies important geoscience questions, often based on data from observations and computer model simulations. Andrea is a data scientist. She studies the newest data analysis methods developed in statistics, data mining, and machine learning. Let’s follow Peter and Andrea as they meet and move through the three major phases of their collaboration experience.
How did Peter and Andrea find each other? They might have run into each other on campus. Maybe one of them attended a talk given by the other. Or maybe a common colleague connected them. If they were actively looking for such collaboration, they might have met at an activity designed to establish new collaborations between Earth and data scientists, such as the annual Climate Informatics workshop or the Intelligent Systems for Geosciences (IS-GEO) Research Collaboration Network.
Once they met, they briefly talked one on one about the methods for data analysis that Andrea is using and about science questions that Peter is interested in. A 15-minute in-person meeting may have been all that was needed for them to discover some common interest and to set up a longer meeting to discuss potential collaboration.
Would it have been better for Andrea to just read papers from the Earth sciences to identify science questions that might be a good match and then to contact the authors for potential collaboration? It is possible that she could find a good collaborator that way. However, Andrea’s chances for success would be small, unless she already has a solid background in Earth sciences, because of the complexity of identifying problems to work on.
Research Phase 1: Defining the Research Problem and Approach
Peter and Andrea begin their research collaboration by defining a problem and choosing an approach to solving it (Figure 1, top right). This first phase is an iterative process that must take into account many different and inherently coupled aspects.
On the Earth science side, this task requires knowledge of which science questions are important and not yet fully understood and knowledge of available data sets. It also requires a deep understanding of the physical processes and interactions being investigated, the temporal and spatial scales at which these interactions take place, and most importantly, intuition of what aspect of a science question might benefit significantly from “mining” large amounts of data with innovative approaches.
On the data science side, this task requires a solid understanding of available data analysis methods and what insights can realistically be gained from them. The task also requires knowing the associated data requirements (minimal sample size and distribution assumptions) and computational effort, as well as common pitfalls and how to avoid them.
The aspects from both sides are inherently coupled: For example, to figure out which algorithm to use, our collaborators first need to understand the properties of the available data and the types of insights they want to gain. Thus, neither Peter nor Andrea can define the research problem in isolation. To define a feasible and meaningful research project, they must work closely together and have frequent conversations. They need to be open-minded and willing to learn the basic vocabulary and way of thinking of each other’s disciplines.
Research Phase 2: Conducting Experiments on the Data
In the second phase, the researchers conduct experiments on the data, such as trying different data analysis methods (Figure 1, middle right). This step sounds like a job mainly for Andrea, but if Andrea works in isolation, there is a good chance that she will take many unnecessary detours that might even cause her to get lost and give up on the project altogether. Only Peter knows what kind of preprocessing or other modifications might help to expose the signals or patterns in the data that they seek to discover.
Therefore, this step also requires constant communication between Peter and Andrea. Every time Andrea tries a new approach, Peter needs to look closely at the results and provide suggestions on how additional preprocessing of the data, focusing on a different spatial or temporal resolution, focusing on a specific geographical area, or rephrasing the scientific question may get the team closer to useful results.
Research Phase 3: Evaluation and Interpretation
Once Peter and Andrea obtain promising results, they need to evaluate them (Figure 1, bottom right). Do the results represent a real physical phenomenon, or are they merely an unforeseen by-product of the data collection or analysis method? Economist and Nobel laureate Ronald Coase once said, “if you torture the data long enough it will confess.”
Thus, before presenting the results as facts of the actual physical processes they studied, Peter and Andrea need to verify that the patterns are, indeed, properties of the underlying system, not just artifacts of the specific data set and analysis method. Ultimately, Peter needs to check whether the results are robust, make physical sense, and can be explained by known or hypothesized interactions in the considered Earth system.
What Does It All Mean?
The last tasks of phase 3 lead us to the final step of the project, namely, to fully understand what the results mean in the context of the science question they set out to address. Do Peter and Andrea’s results answer the original question they asked? What exactly did they learn?
Only if Peter and Andrea take the time to translate the research results back into the real world of physics, dynamics, chemistry, and geosciences and spell out all the implications for the considered Earth system will anyone in the Earth science community care about the results. Peter obviously plays the bigger role in this last step, but he still needs continuous feedback from Andrea to help him interpret the results correctly because only she knows about weaknesses or limitations of the method.
Peter and Andrea learned that they have to work together very closely at every step of their joint project because all their decisions require a deep understanding of both Earth science and data analysis disciplines. Each of them had to be curious about the other’s discipline and also be willing to teach some basic skills or knowledge of their own discipline to the other person. Through this process, they each gained at least some very basic understanding of the nature of the other’s discipline, including its way of thinking, relevant concepts, and terminology.
Working closely together and learning about the other’s field not only made Peter and Andrea’s current collaboration run much more smoothly than it otherwise would have; it also created ideas for future projects. Through close collaboration, Peter and Andrea learned about each other’s fields even as they were contributing to them.