A look at Microsoft Azure Synapse Analytics

Source: mashviral.com

About six months after its announcement at the Ignite conference last fall, Microsoft leveraged a group of MVP analysts and professionals to delve deeper into the Synapse Analytics Analytics service. As noted in Andrew’s coverage last fall, Azure Synapse Analytics is a rebrand and evolution of Azure SQL Data Warehouse, expanding its footprint to span data warehouse, data lakes, and data integration within. from a single cloud service.

The notion of guide is as close as possible to a single source of truth, which in this case involves converging data warehouse, data lake, and data integration. This is a much tougher challenge than it seems, not only to collect highly coveted relational data with a wide range of variable and semistructured data, but also to bring together different groups of professionals with competencies, methods and demands. which are often diametrically opposite.

At one end you have database developers who specialize in working with SQL, while at the other end of the spectrum, data scientists and developers who work on the data lake often work with programmatic analytics in languages like Python. Data warehouses, like any relational system, have typically been used for production and operating scenarios that require reliable performance and often the ability to serve a large population of users, while data lakes are ‘are more closely associated with experimentation with highly diverse datasets and less predictable workloads. which serves a good number of end users.

The result is that you have different workload features, different types of data, and different access patterns. This is the same reason data warehouses spawned years ago, as querying and information workloads interfered with operating systems. But with Azure Synapse, Microsoft is looking at the analytics side to unite the sticks. Although Azure Synapse is a service generally available today, the expanded platform is only six months out of the gate. So while Azure Synapse has the capabilities to support business analysts and data scientists, there are still more pieces that are needed.

Let’s start by keeping the lights on. Workload management has been a well-known topic for data storage for years: Ad hoc research demand patterns, period end reports and complex analytics are known, and since Years ago, turnkey data warehouse providers such as Teradata offered a family of optimized models, respectively, for “data intensive, high calculation, high or low concurrency” and “balanced” workloads to maximize the output of calculation resources.

When Hadoop arrived, it was assumed that the workload would be data intensive and thus the calculation was shifted to the data. Enter the native version of the cloud, and the pendulum again separated the calculation from the data for economic reasons (analytical workloads are infrequent, so why pay the calculation you don’t always use) with the width of band of modern data-driven modern cloud plans. movement problem. Then came the AI, which, depending on machine learning or deep learning, requires divergent resources.

Therefore, joining the data warehouse with the data lake is not a feat. Azure Synapse has addressed the workload issue with a native cloud architecture that relies on the separation of computation from storage in SQL Data Warehouse Gen 2 and extends this concept to heterogeneous SQL and calculates Spark within one. service. At the moment, they are using Azure Data Lake Storage (ADLS) Gen 2, which is designed to provide cloud object storage economies with the performance benefits of exposing data through a system API. files that POSIX complies with. Azure Synapse Analytics also offers a multi-level hierarchical cache inside the SQL engine that automatically moves data between performance levels (including disk storage and the NVD SSD cache) based on the workload of the Spark analytics run on high memory (8 GByte / node) instances.

Functionally, Azure Synapse Analytics starts by combining Azure Data Factory with Azure SQL Data Warehouse: the former remains available as a standalone service, while Azure Synapse replaces the second. And, although you don’t bundle Power BI or Azure Machine Learning directly into the same service, integrations are incorporated at the metadata and user interface levels, so the flow is natural.

Azure Synapse uses the concept of workspace to organize query data and code or artifacts. And the workspace can come out as a low-code / no-code tool for business analysts, or a notebook like Jupyter for data engineers and data scientists working at Spark, or applying machine learning models. In the demos, Microsoft showed how the same data transformation task could be performed in both ways. There will be some differences in experience, for example, while Synapse inherits the ability of Azure SQL Data Warehouse to support high concurrency, Spark environments often involve lone wolf data scientists or data engineers. There are also differences in data security levels: the practice is much more mature when it comes to relational database with table, column and native row level security, but not so mature by the lake of data. This is an area where Cloudera is different from SDX, which is available as part of its platform offering.

Due to the early phase of the Spark feature implementation, Python is currently supported, but R is not yet available. Given the momentum of Python, it’s probably not necessarily a stopper for most data scientists.

As a highly optimized platform, it is not surprising that Microsoft has added some customizations to its notebook implementations such as Spark and Jupyter Interact, and that not all Spark libraries are currently supported. Without delving into the weeds, Microsoft is looking for a more complete implementation of Spark in Azure Synapse after Spark 3.0 is released. However, for data scientists and engineers who want the pure Spark experience, Azure Databricks will still be the best choice.

What is our wish list?

At the moment, Azure Synapse Analytics operates on the idea of a single data lake composed of relational tables, folders and files of different formats. In the future, we would like to see more data platforms reach Azure’s portfolio as we consider the data lake to be the data collection where it is located in the company. To that end, for Spark practitioners, we’d like to see first-party integration with Azure databases. There is room to extend supported computing instances, especially for AI workloads that require GPUs or ASICs. We would also like to see a hybrid strategy where Microsoft already has a foot in the door with Azure Stack and Azure Arc. And we would also like to see an Azure Synapse Partner Program that provides close integration and support for third-party tools that can connect to workspaces.

Oh, and one more thing. Today, Power BI and Azure Machine Learning are treated as ancillary services, as mentioned before, integrated into Synapse, but not included in the service. In the long term, we believe that both services need to be packaged as integral parts of Azure Synapse. Today we believe that virtually every client using Synapse will also use self-service visualization, whether it be with Power BI or third-party tools like Tableau. On the other hand, machine learning is not quite the case today, but we expect this to change quickly enough over the next two years or less with third-party models developed or built internally that will become ubiquitous. This is the writing on the wall.

This is not the first strength of Microsoft in combining the data warehouse and the data lake. For the locals, there was SQL Server 2019 Big Data Clusters, which placed a SQL Server engine on each node in a Hadoop cluster that allowed data lake (as defined initially by clusters with data stored in HDFS) accessible to the SQL query. But Azure Synapse is a complete rethink. Rather than making big data available to SQL and Python developers, it also changes the development environment by creating “workspaces”. It addresses a broader part of the analytics lifecycle – from data ingestion, transformation and integration, to self-service visualization and to collaboration – embedding Power BI reports on teams. from Microsoft.

But closer, Azure Synapse reflects the fact that in the cloud, vendors can break down tool chain silos into more unified bids that cover more than the lifecycle. Microsoft is almost the only vendor that leads this way. SAP Data Warehouse Cloud is taking a similar approach by integrating SAP Analytics Cloud to provide last-mile self-service viewing, while Oracle has begun publicly speaking about extending the Data Data Warehouse to a wider platform, like Azure Synapse, it would include more in the life cycle (hopefully Oracle Analytics Cloud integration will become a major component). So now we expect the next shoes to come down from AWS and Google.