Synapse Archives - Artificial Intelligence

A look at Microsoft Azure Synapse Analytics

aiuniverse — Thu, 16 Apr 2020 06:29:03 +0000

Source: mashviral.com

About six months after its announcement at the Ignite conference last fall, Microsoft leveraged a group of MVP analysts and professionals to delve deeper into the Synapse Analytics Analytics service. As noted in Andrew’s coverage last fall, Azure Synapse Analytics is a rebrand and evolution of Azure SQL Data Warehouse, expanding its footprint to span data warehouse, data lakes, and data integration within. from a single cloud service.

The notion of guide is as close as possible to a single source of truth, which in this case involves converging data warehouse, data lake, and data integration. This is a much tougher challenge than it seems, not only to collect highly coveted relational data with a wide range of variable and semistructured data, but also to bring together different groups of professionals with competencies, methods and demands. which are often diametrically opposite.

At one end you have database developers who specialize in working with SQL, while at the other end of the spectrum, data scientists and developers who work on the data lake often work with programmatic analytics in languages like Python. Data warehouses, like any relational system, have typically been used for production and operating scenarios that require reliable performance and often the ability to serve a large population of users, while data lakes are ‘are more closely associated with experimentation with highly diverse datasets and less predictable workloads. which serves a good number of end users.

The result is that you have different workload features, different types of data, and different access patterns. This is the same reason data warehouses spawned years ago, as querying and information workloads interfered with operating systems. But with Azure Synapse, Microsoft is looking at the analytics side to unite the sticks. Although Azure Synapse is a service generally available today, the expanded platform is only six months out of the gate. So while Azure Synapse has the capabilities to support business analysts and data scientists, there are still more pieces that are needed.

Let’s start by keeping the lights on. Workload management has been a well-known topic for data storage for years: Ad hoc research demand patterns, period end reports and complex analytics are known, and since Years ago, turnkey data warehouse providers such as Teradata offered a family of optimized models, respectively, for “data intensive, high calculation, high or low concurrency” and “balanced” workloads to maximize the output of calculation resources.

When Hadoop arrived, it was assumed that the workload would be data intensive and thus the calculation was shifted to the data. Enter the native version of the cloud, and the pendulum again separated the calculation from the data for economic reasons (analytical workloads are infrequent, so why pay the calculation you don’t always use) with the width of band of modern data-driven modern cloud plans. movement problem. Then came the AI, which, depending on machine learning or deep learning, requires divergent resources.

Therefore, joining the data warehouse with the data lake is not a feat. Azure Synapse has addressed the workload issue with a native cloud architecture that relies on the separation of computation from storage in SQL Data Warehouse Gen 2 and extends this concept to heterogeneous SQL and calculates Spark within one. service. At the moment, they are using Azure Data Lake Storage (ADLS) Gen 2, which is designed to provide cloud object storage economies with the performance benefits of exposing data through a system API. files that POSIX complies with. Azure Synapse Analytics also offers a multi-level hierarchical cache inside the SQL engine that automatically moves data between performance levels (including disk storage and the NVD SSD cache) based on the workload of the Spark analytics run on high memory (8 GByte / node) instances.

Functionally, Azure Synapse Analytics starts by combining Azure Data Factory with Azure SQL Data Warehouse: the former remains available as a standalone service, while Azure Synapse replaces the second. And, although you don’t bundle Power BI or Azure Machine Learning directly into the same service, integrations are incorporated at the metadata and user interface levels, so the flow is natural.

Azure Synapse uses the concept of workspace to organize query data and code or artifacts. And the workspace can come out as a low-code / no-code tool for business analysts, or a notebook like Jupyter for data engineers and data scientists working at Spark, or applying machine learning models. In the demos, Microsoft showed how the same data transformation task could be performed in both ways. There will be some differences in experience, for example, while Synapse inherits the ability of Azure SQL Data Warehouse to support high concurrency, Spark environments often involve lone wolf data scientists or data engineers. There are also differences in data security levels: the practice is much more mature when it comes to relational database with table, column and native row level security, but not so mature by the lake of data. This is an area where Cloudera is different from SDX, which is available as part of its platform offering.

Due to the early phase of the Spark feature implementation, Python is currently supported, but R is not yet available. Given the momentum of Python, it’s probably not necessarily a stopper for most data scientists.

As a highly optimized platform, it is not surprising that Microsoft has added some customizations to its notebook implementations such as Spark and Jupyter Interact, and that not all Spark libraries are currently supported. Without delving into the weeds, Microsoft is looking for a more complete implementation of Spark in Azure Synapse after Spark 3.0 is released. However, for data scientists and engineers who want the pure Spark experience, Azure Databricks will still be the best choice.

What is our wish list?

At the moment, Azure Synapse Analytics operates on the idea of a single data lake composed of relational tables, folders and files of different formats. In the future, we would like to see more data platforms reach Azure’s portfolio as we consider the data lake to be the data collection where it is located in the company. To that end, for Spark practitioners, we’d like to see first-party integration with Azure databases. There is room to extend supported computing instances, especially for AI workloads that require GPUs or ASICs. We would also like to see a hybrid strategy where Microsoft already has a foot in the door with Azure Stack and Azure Arc. And we would also like to see an Azure Synapse Partner Program that provides close integration and support for third-party tools that can connect to workspaces.

Oh, and one more thing. Today, Power BI and Azure Machine Learning are treated as ancillary services, as mentioned before, integrated into Synapse, but not included in the service. In the long term, we believe that both services need to be packaged as integral parts of Azure Synapse. Today we believe that virtually every client using Synapse will also use self-service visualization, whether it be with Power BI or third-party tools like Tableau. On the other hand, machine learning is not quite the case today, but we expect this to change quickly enough over the next two years or less with third-party models developed or built internally that will become ubiquitous. This is the writing on the wall.

This is not the first strength of Microsoft in combining the data warehouse and the data lake. For the locals, there was SQL Server 2019 Big Data Clusters, which placed a SQL Server engine on each node in a Hadoop cluster that allowed data lake (as defined initially by clusters with data stored in HDFS) accessible to the SQL query. But Azure Synapse is a complete rethink. Rather than making big data available to SQL and Python developers, it also changes the development environment by creating “workspaces”. It addresses a broader part of the analytics lifecycle – from data ingestion, transformation and integration, to self-service visualization and to collaboration – embedding Power BI reports on teams. from Microsoft.

But closer, Azure Synapse reflects the fact that in the cloud, vendors can break down tool chain silos into more unified bids that cover more than the lifecycle. Microsoft is almost the only vendor that leads this way. SAP Data Warehouse Cloud is taking a similar approach by integrating SAP Analytics Cloud to provide last-mile self-service viewing, while Oracle has begun publicly speaking about extending the Data Data Warehouse to a wider platform, like Azure Synapse, it would include more in the life cycle (hopefully Oracle Analytics Cloud integration will become a major component). So now we expect the next shoes to come down from AWS and Google.

The post A look at Microsoft Azure Synapse Analytics appeared first on Artificial Intelligence.

A closer look at Microsoft Azure Synapse Analytics

aiuniverse — Wed, 15 Apr 2020 11:34:29 +0000

Source: zdnet.com

Roughly six months after its unveiling at the Ignite conference last fall, Microsoft took a group of analysts and MVP professionals on a deep dive into the Azure Synapse Analytics service. As noted in Andrew’s coverage last fall, Azure Synapse Analytics is a rebrand and evolution of Azure SQL Data Warehouse, broadening its footprint to span data warehousing,data lakes, and data integration within a single cloud service.

The guiding notion is getting as close as possible to a single source of truth, which in this case amounts to converging the data warehouse, data lake, and data integration. That’s a challenge that’s a lot harder than it sounds, as it not only brings together highly curated relational data with broader array of variable and semi-structured data, but it also means bringing together different groups of practitioners with skillsets, methods, and compute demands that are often diametrically opposed.

At one end, you have database developers skilled in working with SQL, whereas at the other end of the spectrum, data scientists and developers working off the data lake have typically worked with programmatic analytics in languages such as Python. Data warehouses, like any relational system, have typically been used for production and operational scenarios demanding reliable performance, and frequently, the ability to serve large population of users, while data lakes are more associated with experimentation with highly varied data sets and less predictable workloads serving a handful of end users.

So, the result is you have different workload characteristics, different data types, and different access patterns. That’s the same rationale that spawned data warehouses years ago, as query and reporting workloads interfered with operational systems. But with Azure Synapse, Microsoft is seeking on the analytics side to bring the poles together. Although Azure Synapse is a generally available service today, the expanded platform is barely six months out of the gate. So, while Azure Synapse has the capabilities to support business analysts and data scientists, there are still more pieces to fall into place.

Let’s start with keeping the lights on. Workload management has been a well-known issue for data warehousing for years – demand patterns for ad hoc inquiry, end of period reporting, and complex analytics are well-known, and for years, turnkey data warehouse system providers like Teradata offered a family of models optimized, respectively, for data-intensive, compute-intensive, high- or low-concurrency, and “balanced’ workloads to maximize the output of compute resources.

When Hadoop came along, it was assumed that the brunt of workloads would be data-intensive, and so compute was moved to the data. Enter cloud-native, and the pendulum swung back to separating compute from data for economic reasons (analytic workloads often tend to be spikey, so why pay for compute you’re not always using) with the high bandwidth of modern cloud backplanes addressed the data movement problem. Then came AI, which depending on whether it’s machine learning or deep learning, has diverging resource requirements.

So, bringing the data warehouse together with the data lake is no mean feat. Azure Synapse has attacked the workload issue with a cloud-native architecture that builds on the separation of compute from storage in SQL Data Warehouse Gen 2 and extends this concept to heterogeneous SQL and Spark compute within a single service. For now, they are using Azure Data Lake Storage (ADLS) Gen 2, which is designed to deliver the economies of cloud object storage with the performance advantages of exposing the data through a file system API that is POSIX-compliant. Azure Synapse Analytics also offers a multi-level hierarchical cache within the SQL engine that automatically moves data between performance tiers (which includes disk storage and NVMe SSD caching) depending on the user workload, while Spark analytics runs on high-memory (8-GByte/node) instances.

Functionally, Azure Synapse Analytics starts by combining Azure Data Factory with Azure SQL Data Warehouse – the former is still available as a standalone service, while Azure Synapse supersedes the latter. And while it does not bundle Power BI or Azure Machine Learning into the same service directly, integrations are built-in at the metadata and user interface levels, so the flow is natural.

Azure Synapse uses the concept of workspace to organize data and code or query artifacts. And the workspace can surface as a low code/no code tool for business analysts or a Jupyter-like notebook for data engineers and data scientists to work in Spark or apply machine learning models. In the demos, Microsoft showed how the same data transformation task could be developed using both paths. There will be some differences in the experience – for instance, while Synapse inherits the Azure SQL Data Warehouse capability to support high concurrency, Spark environments have typically involved lone wolf data scientists or data engineers. There’s also differences in levels of data security – practice is far more mature on the relational database side with table, column, and native row-level security, but not as mature on the data lake side. That’s an area where Cloudera differentiates with SDX, which is available as part of its platform offerings.

Owing to the early stage of the Spark feature implementation, Python is currently supported, but R is not there yet. Given Python’s momentum, that’s probably not necessarily a show-stopper for most data scientists.

As this is a highly optimized platform, it’s not surprising that Microsoft has added some customizations to its Spark and Jupyter-like Interact notebook implementations, and that not all Spark libraries are currently supported. Without diving into the weeds, Microsoft is looking to a more complete Spark implementation in Azure Synapse once Spark 3.0 comes out. Nonetheless, for data scientists and engineers who want the pure Spark experience, Azure Databricks will remain the better choice.

So, what’s on our wish list?

For now, Azure Synapse Analytics operates on the notion of a single data lake composed of relational tables, folders and files of varying formats. In the future, we would like it to reach out to more data platforms in the Azure portfolio, as we view the data lake as being the collection of data, wherever it sits, in the enterprise. Towards that end, for Spark practitioners, we’d like to see first-party integration with Azure Databricks. There’s room to expand supported compute instances, especially for AI workloads requiring GPUs or ASICs. We would also like to see a strategy for hybrid, where Microsoft already has a foot in the door with Azure Stack and Azure Arc. And we’d also like to see an Azure Synapse partner program that would provide tight integration and support for third party tools that could plug into the workspaces.

Oh, and one other thing. Today, Power BI and Azure Machine Learning are treated as ancillary services – as mentioned above, they are integrated into Synapse, but they are not bundled into the service. In the longer run, we believe that both services should be packaged as integral parts of Azure Synapse. Today, we believe that virtually all customers who use Synapse will also be using self-service visualization, whether it’s with Power BI or third party tools like Tableau. On the other hand, today, that’s not quite the case with machine learning, but we expect that to change pretty rapidly within the next couple years or less with internally developed or prebuilt third party models that will become ubiquitous. That’s the handwriting on the wall.

This is not Microsoft’s first stab at bridging the data warehouse and data lake. For on-premises, there was SQL Server 2019 Big Data Clusters, which placed a SQL Server engine on each node of a Hadoop cluster allowing the data lake (as originally defined by clusters with data stored in HDFS) accessible to SQL query. But Azure Synapse is a complete rethink. More than just making big data available to SQL and Python developers alike, it also changes the development environment by creating “workspaces.” It addresses a broader chunk of the analytics lifecycle, from data ingestion, transformation, and integration all the way through self-service visualization and even collaboration by embedding Power BI reports into Microsoft Teams.

But more to the point, Azure Synapse reflects the fact that in the cloud, providers can break down the silos in the toolchain to present more unified offerings covering more of the lifecycle. Microsoft is hardly the only provider heading down this path. SAP Data Warehouse Cloud is taking a similar approach by integrating SAP Analytics Cloud to provide the self-service visualization last mile, while Oracle has begun publicly talking about extending the Autonomous Data Warehouse into a broader platform offering that, like Azure Synapse, would encompass more of the life cycle (we expect that Oracle Analytics Cloud integration to become a core component). So now we’re waiting for the next shoes to drop from AWS and Google.

The post A closer look at Microsoft Azure Synapse Analytics appeared first on Artificial Intelligence.