How IT Supports the Data Science Operation
The data science world in its most puristic state is populated by parallel processing servers that primarily run Hadoop and execute in batch mode, large troves of data that these processors operate on, and statistically and scientifically trained data scientists who know nothing about IT, or about the requirements of maintaining an IT operation.
While there are organizations that include data science specialties within IT and therefore have the IT management and support expertise nearby, there are an equal number of companies that run their data science departments independently of IT. These departments have little clue of the IT disciplines needed to maintain and support the health of a big data ecosystem.
This is also why many organizations are discovering how critical it is to have data science and IT work hand in hand.
For CIOs and data center leaders, who by necessity should be heavily involved in an IT-data science partnership, and what are the important bases that need to be covered to assure IT support of a data science operation?
Two or three years ago, it was a basic rule of thumb that Hadoop, the most dominant big data/data science platform in companies, ran in batch mode. This made it easy for organizations to run big data applications on commodity computing hardware. Now, with the move to more real-time processing of big data, commodity hardware is migrating to in-memory processing, SSD storage and an Apache Spark cluster computing framework. This requires robust processing that can’t necessarily be performed by commodity servers. It also requires IT know-how for configuring hardware components for optimal processing. Accustomed to a fixed record, transactional computing environment, not all IT departments have resident skills for working with or fine-tuning in-memory parallel processing. This is a technical area that IT may need to cross-train or recruit for.
In the Hadoop world, MapReduce is the dominant programming model for processing and generating big data sets with a parallel, distributed algorithm on a cluster. Apache Spark processes in-memory, enabling real-time big data processing. Organizations are moving to more real-time processing, but they also understand the value that Hadoop delivers in a batch environment. From a software standpoint, IT must be able to support both platforms.
Most IT departments function with a hybrid computing infrastructure that consists of in-house systems and applications in the data center, coupled with private and public cloud systems. This has required IT to think outside of the data center, and to implement management policies, procedures and operations for systems, applications and data that may be in-house, in-cloud or both. Operationally, this has meant that IT must continue to manage its internal technology assets in-house, but also work with cloud vendors that technology asset management is outsourced to, or work in the cloud themselves if assets are only hosted, with the enterprise continuing to manage them.
Support for data science and big data in this more complicated infrastructure takes the IT technology management responsibility one step further, because the management goals for big data differ from those of traditional, fixed data.
Among the support issues for big data that IT must decide on are:
- How much big data, which is voluminous and constantly building, should be archived, and which data should be discarded?
- What are the storage and processing price points of cloud vendors, and at what point do cloud storage and processing become more expensive than their in-house equivalents?
- What is the disaster recovery plan for big data and its applications, which are becoming mission critical for organizations?
- Who is responsible for SLAs, especially in the cloud world, when a big data production problem occurs?
- How is data shuttled safely and securely between the cloud and the data center?
Data scientists have expertise in statistical analysis and algorithm development, but they don’t necessarily know how much or which data is available for them to operate on. This is an area where IT excels, because its organizational charter is to track all of the data in enterprise storage, as well as data that is incoming and outgoing.
If a marketing manager wants to develop customer analytics that take into account certain facts that are stored internally on customer records, and also in customers’ purchasing and service histories with the company — and the manager also wants to know what customers are interested in by tracking customer activity on Websites and social media — IT is the most knowledgeable when it comes to determining all paths to achieving a total picture of customer information. And it’s the database group, working in tandem with other IT departments, that develops JOINS of data sets that aggregate all of the data so the algorithms data scientists develop can operate on it to develop truest results.
Without IT’s expertise of knowing where the data is and how to access and aggregate it, analytics and data science engineers would be challenged to arrive at accurate insights that can benefit the business.
IT support of the data science operation is a key pillar of corporate analytics success.
IT enables data scientists to do what they do best — design algorithms to mine the best information from data. At the same time, IT is engaged in its best of class “wheel house” — knowing where to find the data and aggregate it.Mary E. Shacklett is an internationally recognized technology commentator and President of Transworld Data, a marketing and technology services firm. Prior to founding her own company, she was Vice President of Product Research and Software Development for Summit Information