data engineering Archives - Artificial Intelligence

10 MUST-HAVE SKILLS FOR DATA ENGINEERING JOBS

aiuniverse — Tue, 08 Jun 2021 06:27:52 +0000

Source – https://www.analyticsinsight.net/

Big data skills are crucial to land up data engineering job roles. From designing, creating, building, and maintaining data pipelines to collating raw data from various sources and ensuring performance optimization, data engineering professionals carry a plethora of tasks. They are expected to know about big data frameworks, databases, building data infrastructure, containers, and more. It is also important that they have hands-on exposure to tools such as Scala, Hadoop, HPCC, Storm, Cloudera, Rapidminer, SPSS, SAS, Excel, R, Python, Docker, Kubernetes, MapReduce, Pig, and to name a few.

Here, we list some of the important skills that one should possess to build a successful career in big data.

1. Database Tools

Storing, organizing, and managing huge data volumes is critical for data engineering job roles, and therefore a deep understanding of database design & architecture is crucial. The two types of databases commonly used are structure query language (SQL) based, and NoSQL-based. While SQL-based databases such as MySQL and PL/SQL are used to store structured data, NoSQL technologies such as Cassandra, MongoDB, and others can store large volumes of structured, semi-structured & unstructured data as per application requirements.

2. Data Transformation Tools

Big data is present in raw format and cannot be used directly. It needs to be converted to a consumable format based on the use case to process it. Data transformation can be simple or complex depending on the data sources, formats, and required output. Some of the data transformation tools are Hevo Data, Matillion, Talend, Pentaho Data Integration, InfoSphere DataStage, and more.

3. Data Ingestion Tools

Data ingestion is one of the essential parts of big data skills and is the process of moving data from one or more sources to a destination where it could be analyzed. As the amount and formats of data increase, data ingestion becomes more complex, requiring the professionals to know data ingestion tools and APIs to prioritize data sources, validate them, and dispatch data to ensure an effective ingestion process. Some of the data ingestion tools to know are Apache Kafka, Apache Storm, Apache Flume, Apache Sqoop, Wavefront, and more.

4. Data Mining Tools

Another important skill to handle big data is data mining which involves extracting vital information to find patterns in large data sets and prepare them for analysis. Data mining helps in carrying out data classification and predictions. Some of the data mining tools that big data professionals must have hands-on are Apache Mahout, KNIME, Rapid Miner, Weka, and more.

5. Data Warehousing and ETL Tools

Data warehouse and ETL help companies leverage big data in a meaningful manner. It streamlines data that comes from heterogeneous sources. ETL or Extract Transform Load takes data from multiple sources, converts it for analysis, and loads that data into the warehouse. Some of the popular ETL tools are Talend, Informatica PowerCenter, AWS Glue, Stitch, and more.

Also Read: 5 Tips for Preparing Resume for a Data Engineering Interview

6. Real-time Processing Frameworks

Processing the data generated in real-time is essential to generate quick insights to act upon. Apache Spark is most popularly used as a distributed real-time processing framework to carry data processing. Some of the other frameworks to know are Hadoop, Apache Storm, Flink, and more.

7. Data Buffering Tools

With increasing data volumes, data buffering has become a crucial driver to speed up the processing power of data. Essentially, a data buffer is an area that temporarily stores data while moving from one place to another. Data buffering becomes important in cases where streaming data is continuously generated from thousands of data sources. Commonly used tools for data buffering are Kinesis, Redis Cache, GCP Pub/Sub, etc.

8. Machine Learning Skills

Integrating machine learning into big data processing can accelerate the process by uncovering trends and patterns. Using machine learning algorithms can categorize the incoming data, recognize patterns and translate data into insights. Understanding machine learning requires a strong foundation in mathematics and statistics. Knowledge of tools such as SAS, SPSS, R, etc. can help in developing these skills.

9. Cloud Computing Tools

Setting up the cloud to store and ensure the high availability of data is one of the key tasks of big data teams. It, therefore, becomes an essential skill to acquire while working with big data. Companies work with hybrid, public or in-house cloud infrastructure based on the data storage requirements. Some of the popular cloud platforms to know are AWS, Azure, GCP, OpenStack, Openshift, and more.

10. Data Visualization Skills

Big data professionals work with visualization tools in and out. It is required to present the insights and learnings generated in a consumable format for the end-users. Some of the popularly used visualization tools that can be learned are Tableau, Qlik, Tibco Spotfire, Plotly, and more.

The best way to learn these data engineering skills is to get certifications and get hands-on practice by exploring new data sets and integrating them into real-life use cases. Good luck learning them!

The post 10 MUST-HAVE SKILLS FOR DATA ENGINEERING JOBS appeared first on Artificial Intelligence.

DATA ENGINEERING SKILLS AND DEEP LEARNING: MOST PREFERRED SKILL SETS

aiuniverse — Fri, 12 Mar 2021 08:55:05 +0000

Source – https://www.analyticsinsight.net/

Data engineering skills and deep learning are growing in demand

With a humongous 2.5 quintillion bytes of data produced every day, data scientists are busier than at any other time. The more data we have, the more we can do with it. Furthermore, data science gives us strategies to effectively utilize this data. It just bodes well that software engineering has developed to incorporate data engineering skill, a subdiscipline that focuses on the transportation, change, and storage of data.

Data engineering is a subset of data science, a comprehensive term that incorporates numerous fields of information related to working with data. Fundamentally, data science is tied in with getting data for analysis to deliver significant and valuable insights. The data can be additionally applied to offer some value for machine learning, BI, data stream analysis, or any other type of analytics.

Innovations like Artificial Intelligence, Machine Learning, Deep Learning, Data Science, and so on, are turning out to be a hype nowadays. Yet, these advancements are additionally tossed about like trendy words where so many people don’t have a clue what they truly mean or the skills needed for mastering them. In terms of making a career, many individuals are focusing on making a career in data science with deep learning specialization

Deep Learning is a subset of Artificial Intelligence – a machine learning strategy that shows devices and computers how to do logical functioning. Deep learning gets its name from the way that it includes diving deep into numerous layers of network, which additionally incorporates a hidden layer. The deeper you jump, the more intricate insights you remove.

Deep learning neural networks depend on different complex programs to impersonate human intelligence.

According to a report by Udacity, deep learning and data engineering qualifications are top Nanodegree programs showing India’s developing interest in AI and data. While deep learning for computer vision is driving advances in artificial intelligence that are changing our reality, data engineer technical skills are the backbone for the new universe of Big Data.

Regardless of whether it is parking assistance through technology or face recognition at the air terminal, deep learning is fuelling a ton of automation in this day and age. Notwithstanding, deep learning’s importance can be connected most to the fact that our reality is creating dramatic amounts of data today, which needs structuring on a huge scale. Deep learning neural networks utilize the growing volume of information most suitably. All the data gathered from these data is utilized to accomplish precise outcomes through iterative learning models.

While data science and data scientists specifically are worried about exploring data, discovering insights in it, and building machine learning algorithms, a person with a data engineer skill set thinks often about making these algorithms work on a production infrastructure and making data pipelines by and large. Thus, data engineering required skills are making and overseeing the technological infrastructure of a data platform.

There has consistently been immense traction among students to upskill themselves on the tech front. In the wake of the pandemic, rapid digital adoption has additionally created advanced courses alluring the world for forward-looking experts. The discoveries reestablish that the demand for technology-oriented jobs is consistently developing and numerous regions have likewise been able to use technology to upskill themselves for cutting-edge job profiles.

If you wish to begin with a deep learning specialization, candidates should guarantee that their mathematical and programming language skills are set up. Since, Deep learning classification comes under artificial intelligence, knowledge of the more extensive concepts of the domain is always preferred.

Further, skills for any expert relate to the duties they’re responsible for. The range of skills would vary, as there is a wide range of data engineer key skills. However, for the most part, their tasks can be arranged into three primary territories: engineering, data science, and databases/warehouses.

The post DATA ENGINEERING SKILLS AND DEEP LEARNING: MOST PREFERRED SKILL SETS appeared first on Artificial Intelligence.

Deep Learning & Data Engineering Are Top In-demand Skills Of 2020 In India: Reports

aiuniverse — Wed, 10 Mar 2021 09:05:38 +0000

Source – http://bweducation.businessworld.in/

While deep learning is driving advances in artificial intelligence that are changing our world, data engineering is the foundation for the new world of Big Data.

Udacity, a silicon-valley-based global lifelong learning platform, has released the list of the most popular Nanodegree programs in India in 2020. The data is based on the number of enrollments during the year, showing the demand across different states and union territories.

Deep learning and data engineering are top Nanodegree programs showing the country’s growing interest in AI and data. While deep learning is driving advances in artificial intelligence that are changing our world, data engineering is the foundation for the new world of Big Data.

There is no doubt about the fact that COVID-19 has changed the global job landscape. This can be clearly seen in how the demand for some of the Nanodegree programs shifted in 2020 vis-a-vis 2019. Take a look:

Most Popular Nanodegree programs in 2020	Most Popular Nanodegree programs in 2019
Deep learning	AI programming with Python
Data Engineering	Machine Learning Engineer
Product Manager	Deep Learning
Machine Learning Engineer	Data Analyst
Data Scientist	AI Product Manager

There has always been huge traction amongst learners to upskill themselves on the tech front. In the wake of the pandemic, rapid digital adoption has further made advanced courses more desirable worldwide for forward-looking professionals. The findings reinstate that the demand for technology-oriented jobs is continually growing and many regions have also been able to leverage technology to upskill themselves for futuristic job profiles.

Here are some key findings:

Deep learning is big in Karnataka

Karnataka holds the lion’s share for maximum Nanodegree programs. As much as 24% demand for Deep Learning and 34% of the total demand for Data Engineering Nanodegree programs comes from this state. The demand for AI Product Manager (38%) and Product Manager (60%) is also the highest in the state.

Maharashtra digs on data

Data Science and Deep Learning are the most popular Nanodegree programs in Maharashtra. The Product Manager Nanodegree program is popular in Maharashtra. More than 40% of the enrollments come from this state. It is also a popular destination for data science Nanodegree program and Machine Learning Engineer Nanodegree program with 25% and 21% enrollments coming from these regions respectively.

Delhi is all about data & programming

The Indian capital, Delhi, is a frontrunner in the mainstream programming languages. It drew 22% and 23% demand for C++ and Full Stack Web Developer. The Data Analyst Nanodegree program is also big in the region with 21% enrollments coming from the National Capital Region.

Telangana produces more developers

The state of Telangana racked in 16% of enrollments for the Full Stack Web Development Nanodegree program and C++ Nanodegree programs. The Machine Learning Nanodegree program is also big in the state with around 11% enrollments coming from it.

More than half of self-driving car engineers come from Tamil Nadu

Tamil Nadu produced more than 50% Self-Driving car engineers in 2020 in India. Computer Vision and Machine Learning Nanodegree programs are also popular in the state.

The post Deep Learning & Data Engineering Are Top In-demand Skills Of 2020 In India: Reports appeared first on Artificial Intelligence.

FACTORS THAT AFFECT DATA SCIENCE PROJECTS AND HOW TO AVOID THEM

aiuniverse — Wed, 07 Oct 2020 06:31:46 +0000

Source: analyticsinsight.net

What is the reason that a few organizations are so effective with data science and AI while others face difficulties to deliver the expected ROI? For what reason do a few organizations appear to easily adopt data science while others develop solutions that either never make it to the market or are seldom utilized once deployed? Why is it that instead of data science, only analytics and reporting is delivered?

We’re sure you must have faced these questions too! As indicated by PwC ‘s global study, AI will be responsible to give a 26% lift in GDP for local economies by 2030. However, for some organizations, deploying data science into different business functions can be troublesome if not daunting.

As indicated by Gartner expert Nick Heudecker, over 85% of data science projects fail. Isn’t that huge? A report from Dimensional Research showed that only 4% of organizations have effectively deployed ML models. Now, that’s too less!

Significantly more critical, the financial decline brought about by the COVID-19 pandemic has put increased pressure on data science and BI teams to provide more with less. In this down market, companies are rethinking which AI/ML models they should build, how to optimize resources and how to best utilize significant budget dollars for the desired impact. In this kind of environment, AI/ML project disappointment is essentially not acceptable.

Having a data scientist(s) isn’t the primary step in building data science capabilities. It’s a complex buildout that begins with a complete procedure. That is the difference between organizations seeing achievement and those encountering poor or conflicting outcomes.

Like any unpredictable capability to build out, certain factors that determine the success of a data science project.

Deciding Goals in Data Science

Before you even begin planning, defining goals and expectations greatly affects the result both from a technical and business perspective.

A challenging aspect of this cycle is uniting business objectives and mathematical objectives. From one perspective, business decisions are still impacted by vision and business intuition, while on the other, data scientists settle on choices dependent on rigorous performance metrics that don’t in every case straightforwardly convert into business value.

A typical error that companies make is to come up with a quantitative goal, for example, 95% precision for a classification model, with no earlier idea of whether this is a sensible measure: is precision the correct performance metric? Is 95% too simple or too hard with regards to this particular area? Will accomplishing that degree of precision convert into a competitive advantage for the business?

The only fix for this is to invest energy understanding the space of the issue before defining goals and setting expectations. The constructive outcomes of having clearly defined goals stream down to all resulting phases of the cycle, setting guidelines for both project managers and the technical team

Lack of Skills

For almost two years, there has been a widespread talent shortage in the data science domain. LinkedIn revealed in 2018 that there was a lack of more than 150,000 people with data science aptitudes. While the complex interdisciplinary methodology of data science projects includes subject matter experts such as mathematicians, data engineers, and numerous others, data scientists are often the most critical and also the most challenging to select. This implies that organizations are struggling to implement and scale their projects, which thus, is easing back an ideal opportunity to production. Also, numerous organizations can’t manage the cost of the huge teams needed to run various activities all the while.

C-Suite Level Sponsorship

The most significant acknowledgment for an organization utilizing or intending to utilize data science is that this will change how you work together. This implies that most teams and operations will be affected. Data science isn’t something an individual or a single team can make successful. Overall the business needs to buy in and bolster initiatives.

A clear vision, defined by individuals from the C-Suite helps effectively express that idea. When you see a data science initiative with C-Suite visibility, it’s probably going to fail. Visibility isn’t sufficient to drive the sort of progress expected to facilitate data science capabilities.

Data science will surely fail in a siloed organization. CxOs and board members are the only individuals with the power to make enterprise-wide changes. Despite the fact that execution happens at lower levels, buy-in begins at the top.

How to Avoid Such Pitfalls

Data Engineering

You can’t do data science without data– explicitly, “great” data that feeds the data model or algorithm you are utilizing to pick up insights or make forecasts. The efforts required to put in to get data in a structure that is helpful for analysis is frequently underestimated by enterprises.

Your product team needs to make sense of what data you really need, which requires building up a data strategy that guides your product and business strategy. Second, make sure that your team scopes the data engineering effort appropriately. It’s always disparaged when kickstarting a data science initiative. Lastly, it is important to understand that a data engineer has diverse ranges of abilities than a data scientist. You need the former to appropriately comprehend the engineering required, not the latter.

Automation

To help alleviate the actors that cause data science projects to fall flat, the business has seen an increased interest among companies adopting end-to-end automation of the full data science process.

With automating data science, organizations are not just ready to flop quicker (which is something to be thankful for on account of data science), however, to improve their transparency efforts, ensure minimum value pipelines (MVPs), and ceaselessly improve through emphasis.

However, you must be wondering why failing fast is positive? While maybe strange, failing fast can give a huge advantage. Data science automation permits technical and business groups to test speculations and complete the whole data science workflow in days. Originally, this cycle is very extensive that takes months, and is very costly. Automation permits failing hypotheses to be tried and killed quicker. Fast failure of poor projects gives savings as well as increased productivity. This rapid try-fail-repeat process likewise permits organizations to find helpful theories in an all the more ideal way.

The post FACTORS THAT AFFECT DATA SCIENCE PROJECTS AND HOW TO AVOID THEM appeared first on Artificial Intelligence.

DATA ENGINEERING ALWAYS TO THE RESCUE

aiuniverse — Mon, 18 May 2020 06:26:16 +0000

Source: analyticsinsight.net

In the present profoundly competitive digital world, organizations must be data-driven to win. Data has become the fuel for organizations to deliver precise business choices at lightning speed. Data-driven organizations are not just ready to give a superior, more targeted customer experience, however, can likewise comprehend and follow up on new opportunities or dangers ahead of the competition. It is nothing unexpected, at that point, that numerous CEOs have closed down huge, costly digital transformation projects in a bid to transform their conventional organizations into an information-driven marvel.

However, turning out to be information-driven requires more than an eagerness to adopt and incorporate new analytics technologies like machine learning (ML). A recent report by Gartner noticed that “in spite of enormous investments in data and analytics initiatives” practically 50% of all companies surveyed expressed “troubles in bringing them into production”. The truth of the matter is, to truly be information-driven, information must sit at the center of the business. This requires information-driven procedures and culture, yet a genuine comprehension of the teams liable for benefiting as much as possible from this information within the business.

Starting from roots in statistical modeling and data analysis, data scientists have foundations in cutting edge math and statistics, advanced analytics, and increasingly machine learning / AI. The focus of data scientists is, obviously, data science, in other words, how to extricate valuable data from an ocean of information, and how to decipher business and scientific informational needs into the language of data and math. Data scientists should be bosses of statistics, probability, mathematics, and algorithms that help to gather valuable insights from tremendous heaps of data.

These data scientists, as a rule, have gotten the hang of programming due to legitimate need more than anything else so as to run projects and run advanced analysis on information. Thus, the code that data scientists have for the most part been entrusted to write, is of an insignificant sort, only as important to achieve a data science task (R is a typical language for them to utilize) and work best when they are given clean information to run advanced analytics on. A data scientist is a researcher who makes hypotheses, runs tests and analysis of the data and afterwards deciphers their outcomes for another person in the company to effectively see and comprehend.

Then again, data scientists can’t play out their jobs without access to huge volumes of clean information. Extracting, cleaning, and moving data isn’t generally the role of a data scientist, but instead that of a data engineer. Data Engineers have programming and innovation ability and have recently been associated with data integration, middleware, analytics, business data portal, and extract-transform-load (ETL) operations. The data engineer’s center of gravity and abilities are engaged around big data and distributed systems and involvement in programming languages such as Java, Python, Scala, and scripting tools and techniques.

Data engineers are challenged with the task of taking information from a wide range of systems in structured and unstructured formats and information which is normally not “clean”, with missing fields, jumbled information types, and other information related issues. These data engineers need to utilize their programming, integration, architecture, and systems skills to clean all the information and put it into a format and system that data scientists would then be able to use to examine, build their data models and offer value to the organization. Thus, the job of a data engineer is an engineer who designs, builds and arranges data.

Data engineers are getting familiar with analytics so they can make better pipelines. Analysts are learning increasingly refined data science procedures to deliver better bits of knowledge. Data scientists are joining engineering groups to coordinate AI into actual products and services. Also, as a pioneer helping these interdisciplinarians characterize their careers, there is definitely not an unmistakable outline for managing generalists.

Hybrid jobs are entirely important, however, that value is difficult to characterize. They don’t fit in the data science track and they don’t fit in the engineering track. A new track is being discussed called Data Insights, which would fall someplace in between. These individuals will, in general, have a ton of product insights and they can prototype actually rapidly. It’s not really machine learning, it’s not really infrastructure; it’s increasingly about the value you can create through breadth and flexibility.

The eventual fate of data engineering is interlaced with the future of all engineering. This is on the grounds that a significant number of the greatest opportunities for the data engineering field sooner rather than later will be in regions where data engineering covers different fields, particularly software engineering.

For whatever length of time data engineering is viewed as a niche specialty, the knowledge gap will remain. In any case, information is pertinent in each aspect of programming, and along these lines, each feature of software engineering could profit by increasingly cross-fertilization with data engineering. However, as companies become increasingly modern in their utilization of information, data engineering will turn into a greater need and more individuals will enter the training from adjoining fields. With them will come new and important perspectives.

The post DATA ENGINEERING ALWAYS TO THE RESCUE appeared first on Artificial Intelligence.

Microsoft Open-Sources ONNX Acceleration for BERT AI Model

aiuniverse — Wed, 29 Jan 2020 09:42:00 +0000

Source: infoq.com

Microsoft’s Azure Machine Learning team recently open-sourced their contribution to the ONNX Runtime library for improving the performance of the natural language processing (NLP) model BERT. With the optimizations, the model’s inference latency on the SQUAD benchmark sped up 17x.

Senior program manager Emma Ning gave an overview of the results in a blog post. In collaboration with engineers from Bing, the Azure researchers developed a condensed BERT model for understanding web-search queries. To improve the model’s response time, the team re-implemented the model in C++. Microsoft is now open-sourcing those optimizations by contributing them to ONNX Runtime, an open-source library for accelerating neural-network inference operations. According to Ning,

With ONNX Runtime, AI developers can now easily productionize large transformer models with high performance across both CPU and GPU hardware, using the same technology Microsoft uses to serve their customers.

BERT is a NLP model developed by Google AI, and Google announced last year that the model was being used by their search engine to help process about 1-in-10 search queries. BERT is useful for handling longer queries, or queries where short words (like “for” and “to”, which are often ignored in standard search engines) are especially relevant to the meaning of the query. Bing also began using deep-learning NLP models in their search engine last year. However, BERT is a complex model, which means that processing a search query through it (aka inference) is computationally expensive and relatively slow. Bing found that even a condensed three-layer version required twenty CPU cores to achieve 77ms latency, which is already close to the limit for users to notice a delay. To handle the volume of queries at Bing’s scale, using this model would require thousands of servers.

BERT inference does benefit from the parallelism of GPUs, and Bing found that the inference latency on Azure GPU VMs dropped to 20ms. For further improvements, the team partnered with NVIDIA to re-reimplement the model using TensorRT C++ APIs and CUDA libraries. This optimized model achieved 9ms latency. By using mixed precision and Tensor Cores, the latency improved to 6ms.

Deep-learning inference performance is a major concern, for web searches as well as mobile and edge devices, but re-implementing models by hand is not an attractive solution for most pracitioners seeking to improve performance. Inference acceleration tools, such as TensorFlow Lite and PyTorch Mobile, are now standard components of deep-learning frameworks. These tools improve performance by automatically re-writing the model code to take advantage of device-specific hardware acceleration and optimized libraries. This process is very similar to that used by an optimizing compiler for a high-level programming language, and similarly requires an abstract representation of the model being optimized. ONNX is an open standard for such a representation, and ONNX Runtime is an implementation of the standard.

Taking the lessons learned from re-implementing BERT, the Bing and Azure devs updated the ONNX Runtime code to automatically optimize any BERT model for inference on CPU as well as GPU. When used on the three-layer BERT model, CPU performance improved 17x and GPU performance improved 3x. Bing developers also found the ONNX Runtime was easier to use and reduced their time to optimize new models.

BERT model optimizations are available in the latest release of ONNX Runtime on GitHub.

The post Microsoft Open-Sources ONNX Acceleration for BERT AI Model appeared first on Artificial Intelligence.

Data Science Added as Academic Major & Minor

aiuniverse — Fri, 18 Oct 2019 07:30:05 +0000

Source: rose-hulman.edu

Rose-Hulman continues to adapt its curriculum to meet the challenges of today’s fast-paced world.

The institute has added data science as a secondary academic major. This multidisciplinary field combines science, mathematical algorithms and computer science principles to extract knowledge and insights from data. The new course of study is offered by departments covering computer science, software engineering and mathematics, but it is available to students in all major courses of study offered by the institute.

Students also can earn a multidisciplinary minor in data science. Learn more about the new academic program here.

Program Coordinator Sriram Mohan, associate professor of computer science and software engineering, says the new major will provide students with in-depth hands-on experience in data engineering, data analysis, machine learning and artificial intelligence. He points out that analysis of other college data science programs shows that Rose-Hulman will better prepare students for careers in these areas by requiring more fundamental courses, more advanced electives and a capstone experience that’s focused on data science.

“Data science is revolutionizing organizations, and graduates with data science and analytic skills will be critical to the future of the world’s economy,” Mohan says.

As early as next year, computer scientists predict that 1.7 megabytes of data will be generated every second for every human on Earth. The total amount of data is expected to be around 44 trillion gigabytes.

Data science and machine learning are two of the fastest growing areas of the technology sector, with job demand currently far outweighing prospective employees with skills in the field, according to LinkedIn’s 2017 U.S. Emerging Jobs Report.

Google’s chief economist Hal Varian has stated, “The ability to take data – to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it – that’s going to be a hugely important skill in the next decades.”

The post Data Science Added as Academic Major & Minor appeared first on Artificial Intelligence.

Digital disruption shapes big data infrastructure, data engineering

aiuniverse — Fri, 15 Sep 2017 06:42:13 +0000

Source – techtarget.com

A good amount of flexibility has always been required for data professionals. They need even more flexibility today in industries where big data infrastructure is causing equally big changes in business practices.

This is sometimes called digital disruption, and it has an effect on how data engineering is evolving. That is the case whether the data pro is working for a disruptor, or for a company that could be disrupted.

To get the view of the disruptor, just ask Tarush Aggarwal, director of data engineering at WeWork Companies Inc. This startup builds out and rents interim or temporary workspaces to other startups. This business model could threaten conventional commercial real estate practices. Think of it as a kind of Uber — rather than sharing a ride in someone’s car, you share office space co-working with others as needed, and on a somewhat open-ended basis.

Nimble big data infrastructure

If much of WeWork’s efforts are about nimbly building out infrastructure, so are Aggarwal’s. But his efforts center on data infrastructure and what is needed today to inform those working to grow the company.

“Our focus is on what the business is doing. A data science team can live six months in the future, but a data engineering team has to live right now,” he said. In the age of web-borne big data, living in the now means handling a lot of quickly arriving data.

In terms of data ingestion, emphasis is placed more on extract, load and transform (ELT) than on extract, transform and load processes, according to Aggarwal.

“The advantage that ELT gives you is it allows you to separate your ingestion from your transformation. That allows you to automate it completely,” he said.

Also, Aggarwal added, separation of ingestion from transformation means WeWork can apply different data transformations later on, should someone get a better idea of what to do with the data.

Aggarwal shares a disruptor’s view. He advises that data engineers spend time looking at how data in the organization is being used, working toward optimizing access to that data, and then add features on an ongoing basis.

Data reliability is important, he emphasized, “but not at the cost of flexibility.”

Cord cutters call the tune

To see today’s data engineering from another point of view, you could turn to Jeffrey Pinard. As vice president for data technology and engineering for advanced advertising initiatives at NBC, he is at the center of the 91-year-old peacock network’s efforts to respond to disruption in the television advertising business. Like Aggarwal, he spoke as part of this month’s Big Data Innovation Summit in Boston.

NBC is acutely aware that today’s audience is moving, in some part, from traditional television viewing to internet cord cutting. Useful data is available on that internet audience, and it is changing the way advertising decisions are made. Things are different than they were in the days when Nielsen broadcast ratings were king.

“We need to change the way NBC approaches advertising,” said Pinard.

In pursuing that objective, NBC set out to build a portfolio of audience analytics products called Audience Studio. There is plenty of data engineering involved.

“To support this, we needed to build a foundation from scratch — an infrastructure that was going to support our needs for the future,” he said.

That meant changes, as NBC was traditionally, in Pinard’s words, “an on-premises organization.” The infrastructure build-out needed to be cost-effective and to support the technology changes over time, he said, and cloud came under consideration.

Pinard and his colleagues came up with a fairly unique approach — a cloud data lake. While it is somewhat in the spirit of Hadoop distributed processing, it actually forgoes Hadoop. Pinard described the use of Amazon Web Services Simple Storage Service, Apache Spark, Apache Parquet, Mesos and containers in building an on-cloud data lake that takes ingested data, allows for elastic processing and supports data access according to end users’ permissions and job needs.

Moreover, the ability to store vast amounts of data enables end users to trace data’s lineage, which is a useful trait in meetings that too often revolve around finding out how somebody arrived at a certain data point.

Data is central to transformation

There are threads connecting the disruptors’ schemes with those that would potentially be disrupted. With any effective big data infrastructure, the ultimate point is to understand how people use it.

The post Digital disruption shapes big data infrastructure, data engineering appeared first on Artificial Intelligence.