Source – insidebigdata.com
As Big Data moves from hype to reality, more companies are adopting a “data first” approach. Data is used to formulate strategies, design products, embed intelligence in applications, and ultimately provide an awesome customer experience. With increased data awareness, executives are also demanding more from their data teams – just mining data to derive insights is not completely sufficient. The emphasis is now on speed analytics which provide the agility and flexibility to react to data in real-time. Analytics pipelines are also expected to process data from a growing number of data sources and changing data formats.
With these challenges on Big Data teams, the demand for skilled Big Data professionals is expected to rise. While “data science” was famously touted to be the hottest skill in demand, it is now obvious that there is an equally large request for skills that are responsible for getting the data to the data scientists.
In this article, we detail the 4 categories of skills that are critical for professionals in the field of Big Data. These categories can be aligned with the layers of a Big Data solution, starting with building the infrastructure (Administer), developing data pipelines to transport data (Develop), analyzing the data to extract meaningful insights (Analyze), and communicating to the stakeholders and executives (Visualize).
The physical infrastructure (consisting of servers, networks, clusters, and storage) forms the backbone of a Big Data implementation. This infrastructure needs to be carefully planned and configured to withstand the demands of data storage and processing. DevOps for Big Data is a key administration skill and often requires knowledge of cloud architectures as Big Data processing is extremely well-suited for the elastic and highly available nature of the cloud. Hadoop’s HDFS (Hadoop Distributed File System) and NoSQL databases are widely seen as reliable ways to store large volumes of data on clusters of thousands of inexpensive servers. Knowledge of administering Hadoop and NoSQL becomes crucial in this role.
This category deals with skills essential for building data pipelines and frameworks that ultimately provide data for the data scientists to analyze. Data pipelines are responsible for ingesting data from a variety of sources, including real-time and batch. Hadoop and Spark are the most popular Big Data frameworks used across organizations. While Hadoop is the popular choice to store Big Data, Spark has stepped in to provide a lightning fast framework to process data.
Spark replaces Hadoop’s MapReduce to provide a unified data processing platform. Spark’s modules that consist of Spark SQL, MLlib, Spark streaming, and GraphX can be combined to develop entire data pipelines. These unified data pipelines are capable of ingesting streaming data, querying and manipulating data, and further applying machine learning algorithms and graph processing techniques to yield intelligent insights. Programming experience is necessary in languages like Python, Java, and Scala, for a Big Data developer. Cloud experience is also valuable where services like AWS Kinesis and Lambda offer alternatives for real time processing in a micro-service based architecture.
This category encompasses a mix of expertise in statistics, machine learning, data mining, operations research, mathematics, and computer programming. Data scientists, data analysts, machine learning engineers, and artificial intelligence engineers are required to dive into petabytes of messy data, build algorithms, and then write robust code to automate the algorithms and prove their performance on large-scale live data. In addition to familiarity with the Hadoop ecosystem, this role also demands experience with programming languages like R, Java, Scala, and Python.
Visualization developers tell a story using the data collected and design dashboard visualizations tailored to customer needs. These people serve as a technical resource for accessing disparate sources of data and integrating these sources into a common and interactive platform that effectively displays how the company’s data meets their Key Performance Indicators (KPI).
These are the four top skill sets in the Big Data profession, but it is not uncommon to see roles where the lines between these different categories get blurred. For example, a full stack data scientist is responsible for working across the entire stack of data. Depending on the size and needs of the organization, when it comes to Big Data, be prepared to work across the lines.