Databricks Archives - Artificial Intelligence

Databricks Unveil New Machine Learning Solution

aiuniverse — Fri, 11 Jun 2021 04:49:21 +0000

Source – https://www.datanami.com/

Databricks today unveiled a new cloud-based machine learning offering that’s designed to give engineer everything they need to build, train, deploy, and manage ML models.

The new offering is designed to bridge the gap in existing machine learning products that arises by focusing too much on data engineering, ML model creation, or the deployment aspects of the machine learning cycle, Databricks says.

“Many ML platforms fall short because they ignore a key challenge in machine learning: they assume that data are available at high quality and ready for training,” Databricks says in its announcement. “That requires data teams to stitch together solutions that are good at data but not AI, with others that are good at AI but not data.”

To address this gap, Databricks lets users switch between user “experiences” that it exposes, including data science/engineering, SQL analytics, and machine learning experiences, to access tools and features relevant to their everyday workflow.

The new Databricks offering includes two major components: AutoML and Feature Store.

Databricks AutoML, like other AutoML solutions, automates many of the steps that data scientists must manually go through in terms of experimenting and testing different machine learning models. But instead of working like a “black box,” like other AutoML offerings, the new Databricks offering works like a “glass box,” according to Databricks vice president of marketing Joel Minnick.

AutoML will give data scientists the ability to peer into the inner workings of the model to see how things are working, either through the Databricks UI, through an API call, or through the notebook interface, he says.

“We will do everything that you expect an AutoML tool to do, like analyzing the data, figuring out the features, training and tuning the model,” he says. “But what we give you on the end of that process is all the experiment we ran, all of the notebooks we auto-generated as a result of those [runs] and let you compare those models and decide, perhaps I’d be willing to give up a point of accuracy if I can get inference 200 milliseconds faster with this model versus perhaps the most accurate model.”

If edge cases arise and the model is acting up, Databricks AutoML lets the data scientist “dive into the notebook code and have full control over that notebook code, so that that if I do need to spend some time accounting for edge cases, doing some additional tuning, I can do that,” Minnick says. “I can also understand how this model works, so if I have to explain to regulatory or to compliance authorities exactly how I’m making these decisions, I’m able to do that and be sure that that transparency is there.”

Feature Store, meanwhile, automatically tracks the machine learning features that are core to the functioning of the machine learning models. This helps to ensure the performance of the model doesn’t drift over time. The Feature Store is integrated with Databricsk’ Delta Lake platform and uses Delta Lake APIs. It’s also integrated with MLflow, which simplifies ongoing management of the model.

The features are embedded directly into the models when packaged with MLFlow, which simplifies the process of changing the features at a later time, Minnick says.

“So I never have to engage the application engineering team to make changes to the client application just because I’m evolving the features,” he tells Datanami. “This is a way for us to help customer get models to production and iterate those models faster and easier than before.”

The post Databricks Unveil New Machine Learning Solution appeared first on Artificial Intelligence.

Databricks To Offer Its Big Data Analytics System On Google Cloud

aiuniverse — Thu, 18 Feb 2021 05:40:35 +0000

Source – https://www.crn.com/

The Databricks Unified Data Platform, using the Google Kubernetes Engine, can be deployed in a containerized cloud environment and link to Google BigQuery and other GCP services.

Databricks’ Unified Data Platform will be available on the Google Cloud Platform beginning in April, the two companies said Wednesday, completing a trifecta for Databricks whose software already runs on the Amazon Web Services and Microsoft Azure cloud platforms.

Databrick’s software also will be integrated with the Google BigQuery business analytics system and leverage the Google Kubernetes Engine – enabling businesses and organizations to deploy Databricks in a containerized cloud environment for the first time.

With Databricks available on Google Cloud, customers can create “data lakehouses” – a combination of data lake and data warehouse systems that Databricks has been promoting – for a range of business analytics, data engineering, data science and machine learning tasks.

The alliance with Google comes at a time when Databricks has been gaining momentum – and a lot of attention – in the big data space. Earlier this month the company raised $1 billion in a Series G round of financing, boosting its post-money valuation to $28 billion in a move many see as a precursor to an IPO sometime this year.

The alliance also comes as more businesses and organizations are choosing to store their data – especially data managed for analytical tasks – in the cloud and run their business analysis and data visualization tools on cloud platforms.

“The cloud is huge now. But it’s really the beginning of the journey for a lot of customers – especially when it comes to data and data warehouses,” said David Meyer, Databricks senior vice president of product management, in an interview with CRN.

“We’re seeing a very significant acceleration of, specifically, those data workloads moving to the cloud,” said Kevin Ichhpurani, Google corporate vice president and head of Google Cloud global ecosystem and business development, also in the interview with CRN.

Ichhpurani said those efforts to move data and business analytics to the cloud are being driven by rapidly growing data volumes and the role that data analytics, data orchestration and machine learning are playing in digital transformation initiatives and the re-invention of business processes.

The Google executive also emphasized that the alliance goes beyond joint development to include go-to-market and sales activities – including working with channel partners.

The two companies said that members of their partner ecosystems have committed to support Databricks on Google Cloud including systems integrators Accenture, Cognizant and Deloitte; strategic service providers such as Slalom and SADA; and software developer partners including Qlik, Tableau, Informatica and Confluent.

Meyer noted that many businesses and organizations are decommissioning older data warehouse and analytics systems and moving them to the cloud. “It’s a great time for these SIs [systems integrators] to lean in and help these customers figure out where they want to be three or four years from now.”

Ichhpurani said several of the systems integrators who partner with Google and Databricks are interested in launching practices around the Databricks-on-Google Cloud offering.

The Databricks Unified Data Platform on Google Cloud is currently in beta testing and is expected to be generally available in April. It will be available through the Google Cloud Marketplace for simpler procurement, user provisioning, sign-on and unified billing.

Databricks on GCP will allow customers to rapidly deploy and scale Databrick’s software on Google Cloud’s global network and easily adjust the usage rate based on their current needs.

The two companies also said the advanced security and data protection controls provided by the GCP will make the combination especially attractive for use within highly regulated industries.

By using the Google Kubernetes Engine as the operating environment for running Databricks on Google Cloud, Databricks can leverage Kubernetes managed services for security, network policy and compute, according to the two companies. It also speeds the release of new features at scale and at lower cost.

The companies have also developed connectors to integrate Databricks with Google BigQuery, Google Cloud Storage, Google’s Looker data exploration and discovery tool, and Google’s Pub/Sub messaging and data ingestion system.

“By combining Databricks’ capabilities in data engineering and analytics with Google Cloud’s global, secure network—and our expertise in analytics and delivering containerized applications—we can help companies transform their businesses through the power of data,” said Thomas Kurian, Google Cloud CEO, in a statement.

The availability of Databricks on Google Cloud “deliver[s] on our shared vision of a simplified, open and unified data platform that supports all analytics and AI use-cases that will empower our customers to innovate even faster,” said Databricks CEO Ali Ghodsi in the same statement. “This is a pivotal milestone that underscores our commitment to enable customer flexibility and choice with a seamless experience across cloud platforms.”

The post Databricks To Offer Its Big Data Analytics System On Google Cloud appeared first on Artificial Intelligence.

Does Recent Databricks’ Massive Investment Signal A Maturing Data Science Industry?

aiuniverse — Thu, 11 Feb 2021 08:26:11 +0000

Source – https://analyticsindiamag.com/

In February 2021, San Francisco- based Databricks closed a $1 billion late-stage round led by Franklin Templeton. Canada Pension Plan investment board, Amazon Web Services (AWS), and Salesforce Ventures have also participated in the Series G funding.

The billion-dollar investment came in the wake of Databricks’ partnership with the cloud companies, which CEO Ali Ghodsi called a ‘symbiotic relationship of strategic importance’.

With such firms attracting significant investments, is the data science industry maturing?

Growth Phase

Databricks has seen rapid growth in the past couple of years. In 2019, the company raised $250 million and $400 million in two funding rounds. Currently, the company is valued at $28 billion.

Many other data companies are also seeing huge investments. Snowflake, a cloud data warehousing firm, attracted funding of $479 million in February 2020, taking the total investment tally to $1.4 billion. In September, the company raised $3.36 billion in the biggest software IPO ever.

Alteryx, a data science company, listed in 2017, has raised $163 million in funding and is currently valued at $9.11 billion. Datarobot raised $270 million investment last year, taking its total funding to $751 million, and is currently valued at $2.7 billion.

All of these companies have shown high revenue growths. Alteryx generated quarterly revenue growth of 26% year-on-year, whereas Snowflake grew by 119%. While Databricks is not publicly listed, according to news reports, it grew by 75% in Q3 2020 to $350 million, up from $200 million in Q3 2019.

The two listed companies, Alteryx and Snowflake, showed a negative Earning Per Share (Trailing Twelve Months) of 0.24 and 5.11 last year in September. Alteryx grew its customer base by 27% at the end of the second quarter of 2020, and Snowflake doubled its customer base last year.

Matured industries tend to have slow and steady progress with consistent returns. Hence, even with the high valuations, indicators like the rising revenues, investments, and customer bases show that the data science industry is in a growing phase. Investments pour in tantamount to industry’s topicality and the potential applications and services.

The right strategies in investments, along with mergers and acquisitions, play a role in the maturing of industries.

Aligned Strategies

In an interview following the Databricks funding round, Ghodsi said the Series G made a lot of sense with respect to the participants. On closer inspection, Databricks funding round is strategic investments by big cloud players on an open and unified data analytics platform. In other words, giants like AWS and Salesforce are finding niche opportunities in niche markets. Databricks’ acquisition of Redash, a data visualisation platform last year, ties in perfectly with the overall vision.

In the case of Snowflake, the latest round came from Salesforce Ventures. Snowflake is a data cloud firm providing various applications, including data sharing, data science, data warehousing, among others. Salesforce, on the other hand, is the world’s number one CRM solutions provider that needs huge data storage and platforms for data analysis.

Similarly, the DataRobot, that provides an end-to-end enterprise AI platform to build value from data, has seen fundings mostly from tech investors. It acquired Cursor in 2019 to bolster data collaborations.

Wrapping Up

Compared to five years back, data science firms are experiencing a spurt of growth and are profusely raising funds. The strategic collaborations and mergers also suggest the industry is on track to consolidation. However, the key metrics suggest the industry is in a high-growth phase and is far from being mature.

The post Does Recent Databricks’ Massive Investment Signal A Maturing Data Science Industry? appeared first on Artificial Intelligence.

Databricks bolsters security for data analytics tool

aiuniverse — Mon, 23 Mar 2020 07:27:33 +0000

Source:

Some of the biggest challenges with data management and analytics efforts is security.

Databricks, based in San Francisco, is well aware of the data security challenge, and recently updated its Databricks’ Unified Analytics Platform with enhanced security controls to help organizations minimize their data analytics attack surface and reduce risks. Alongside the security enhancements, new administration and automation capabilities make the platform easier to deploy and use, according to the company.

Organizations are embracing cloud-based analytics for the promise of elastic scalability, supporting more end users, and improving data availability, said Mike Leone, a senior analyst at Enterprise Strategy Group. That said, greater scale, more end users and different cloud environments create myriad challenges, with security being one of them, Leone said.

“Our research shows that security is the top disadvantage or drawback to cloud-based analytics today. This is cited by 40% of organizations,” Leone said. “It’s not only smart of Databricks to focus on security, but it’s warranted.”

He added that Databricks is extending foundational security in each environment with consistency across environments and the vendor is making it easy to proactively simplify administration.As organizations turn to the cloud to enable more end users to access more data, they’re finding that security is fundamentally different across cloud providers.Mike LeoneSenior analyst, Enterprise Strategy Group

“As organizations turn to the cloud to enable more end users to access more data, they’re finding that security is fundamentally different across cloud providers,” Leone said. “That means it’s more important than ever to ensure security consistency, maintain compliance and provide transparency and control across environments.”

Additionally, Leone said that with its new update, Databricks provides intelligent automation to enable faster ramp-up times and improve productivity across the machine learning lifecycle for all involved personas, including IT, developers, data engineers and data scientists.

Gartner said in its February 2020 Magic Quadrant for Data Science and Machine Learning Platforms that Databricks Unified Analytics Platform has had a relatively low barrier to entry for users with coding backgrounds, but cautioned that “adoption is harder for business analysts and emerging citizen data scientists.”

Bringing Active Directory policies to cloud data management

Data access security is handled differently on-premises compared with how it needs to be handled at scale in the cloud, according to David Meyer, senior vice president of product management at Databricks.

Meyer said the new updates to Databricks enable organizations to more efficiently use their on-premises access control systems, like Microsoft Active Directory, with Databricks in the cloud. A member of an Active Directory group becomes a member of the same policy group with the Databricks platform. Databricks then maps the right policies into the cloud provider as a native cloud identity.

Databricks uses the open source Apache Spark project as a foundational component and provides more capabilities, said Vinay Wagh, director of product at Databricks.

“The idea is, you, as the user, get into our platform, we know who you are, what you can do and what data you’re allowed to touch,” Wagh said. “Then we combine that with our orchestration around how Spark should scale, based on the code you’ve written, and put that into a simple construct.”

Protecting personally identifiable information

Beyond just securing access to data, there is also a need for many organizations to comply with privacy and regulatory compliance policies to protect personally identifiable information (PII).

“In a lot of cases, what we see is customers ingesting terabytes and petabytes of data into the data lake,” Wagh said. “As part of that ingestion, they remove all of the PII data that they can, which is not necessary for analyzing, by either anonymizing or tokenizing data before it lands in the data lake.”

In some cases, though, there is still PII that can get into a data lake. For those cases, Databricks enables administrators to perform queries to selectively identify potential PII data records.

Improving automation and data management at scale

Another key set of enhancements in the Databricks platform update are for automation and data management.

Meyer explained that historically, each of Databricks’ customers had basically one workspace in which they put all their users. That model doesn’t really let organizations isolate different users, however, and has different settings and environments for various groups.

To that end, Databricks now enables customers to have multiple workspaces to better manage and provide capabilities to different groups within the same organization. Going a step further, Databricks now also provides automation for the configuration and management of workspaces.

Delta Lake momentum grows

Looking forward, the most active area within Databricks is with the company’s Delta Lake and data lake efforts.

Delta Lake is an open source project started by Databrick and now hosted at the Linux Foundation. The core goal of the project is to enable an open standard around data lake connectivity.

“Almost every big data platform now has a connector to Delta Lake, and just like Spark is a standard, we’re seeing Delta Lake become a standard and we’re putting a lot of energy into making that happen,” Meyer said.

Other data analytics platforms ranked similarly by Gartner include Alteryx, SAS, Tibco Software, Dataiku and IBM. Databricks’ security features appear to be a differentiator.

The post Databricks bolsters security for data analytics tool appeared first on Artificial Intelligence.

Technical Enablement Program Manager

aiuniverse — Sat, 27 Jul 2019 13:58:03 +0000

Source: datarobot.com

DataRobot is proudly expanding its enablement operations. We are seeking a technical person who can spearhead enablement efforts in support of our customer-facing engineering team.

In this role you will deeply understand the technologies & processes employed by our customer facing engineers, and in turn teach/train and enable new hires to quickly ramp up to speed.

The successful candidate will have a working knowledge of the various technologies we use to deploy our platform within client-organizations, including but not limited to: Hadoop, Domino, DataBricks, Python, R, C# and Java.

In addition to these deep technical skills, a standout candidate will also have a working knowledge of various statistical and data science / machine learning techniques such as: Data preprocessing, Regression, Classification, Clustering.

This role is based in Boston, MA. Supremely qualified candidates from other geographies in the USA are invited to apply for consideration.

About our field engineers:

A field engineer at DataRobot supports both the sales and data science teams during presales and postsales. Pre-sales activities include but are not limited to security, architecture and POC-related discussions and support. Field Engineers usually have named accounts and are responsible for the successful maintenance of our platform.

About You:

Excellent writing and communication skills
Experience installing, managing and supporting platforms running on Linux
Data management/wrangling or software development experience
Working knowledge of enterprise software with experience of supporting its deployment in large enterprise accounts
Bonus: Experience of working with enterprise customers, providing engineering, infrastructure, and technical guidance and support

Additional Bonus points:

Experience with cloud IaaS providers (AWS EC2, Azure)
Hadoop, Spark or BI expertise
Bachelors in Computer Science, related field or equivalent demonstrable experience

The post Technical Enablement Program Manager appeared first on Artificial Intelligence.

Databricks Runtime 5.5 previews Instance Pools

aiuniverse — Thu, 18 Jul 2019 12:29:18 +0000

Source: devclass.com

Databricks, the company behind open source project Apache Spark, has given its Runtime a good old polishing, buffing the version number up to 5.5.

The new Databricks Runtime is, amongst other things, able to use AWS Glue instead of Hive, and R notebooks have been added to the Python and Scala spanning list of notebooks the product’s Secrets API can inject secrets into.

Version 5.5 also comes with a couple of preview features. One of them is Instance Pools, which lets users hold back some virtual machines which can be used to quickly spin up clusters if needed. While the VMs are idle, only cloud provider costs are incurred with no costs at all if the pool is scaled down to zero instances, according to Databricks.

Those using the Databricks Runtime on AWS can give querying Delta Lake tables from Presto or Amazon Athena a go, and improve the final version by leaving feedback. The function is realised via manifest files the services can examine instead of going through the directory listing to find files.

A feature only available by contacting support, is a new version of the Databricks Filesystem FUSE (Filesystem in userspace) client. The reworked offering is meant to improve performance on all DBFS locations, mounts included, after previous runtime versions already introduced high-performance FUSE storage to dbfs:/ml.

Along with the normal release, there is also a new version of the Runtime for Machine Learning available. Databricks Runtime for ML 5.5 comes with a MLflow 1.0 package added, and upgrades for TensorFlow, PyTorch, and scikit-learn. The ML-specific runtime also saw an HorovodRunner update, giving users a way of distributing their training within a single node, which is meant to make the use of multiple GPUs more efficient.

More adventurous Databricks customers are able to try a preview of a function allowing the recursive loading of files from nested input directories, as well as the Pandas UDF type scalar iterator. The latter can lead to a speedup for some models, since it helps to apply a model to multiple input batches without having to initialise it again and again.

Looking forward, Databricks is planning to drop Python 2 support with the release of Runtime 6.0, which should happen later in 2019. However, there are plans to offer long-term support for the last 5.x release, to make sure there is still a maintained version to run Python 2 code on a little longer if necessary. The step isn’t that surprising, given that that version of the programming language is coming to its end of life next year.

The post Databricks Runtime 5.5 previews Instance Pools appeared first on Artificial Intelligence.

Databricks wants one tool to rule all AI systems – coincidentally, its own MLflow tool

aiuniverse — Sat, 08 Jun 2019 11:09:40 +0000

Source:- theregister.co.uk

Turns out people are not that great at tracking thousands of variables

American upstart Databricks, established by the original authors of the Apache Spark framework, reckons its open-source machine-learning management engine MLflow is ready for prime time.

The released version 1.0 of the platform focuses on core API components. It improves the handling of metrics and search functionality, and adds support for Hadoop as an artifact store, in addition to the previously supported Amazon S3, Azure Blob Storage, Google Cloud Storage, SFTP, and NFS.

It also adds an experimental Open Neural Network Exchange (ONNX) model flavour, and a CLI command for building a Docker image capable of serving an MLflow model.

And finally, there’s Windows support for the MLflow client – in the unlikely event data scientists decide to opt for something other than Linux.

MLflow enables data scientists to track and distribute experiments, package and share models across frameworks, and deploy them – no matter if the target environment is a personal laptop or a cloud data centre.

The company launched the alpha version of MLflow project last year at the Spark + AI Summit.

Multiple code approaches

The basic machine learning life cycle – taking raw data, preparing it, training your model and deploying it – is full of variables and fraught with complications. It can involve hundreds of different open source tools and frameworks, each with dozens of configurable parameters.

Facebook, Google and Uber have all built their own proprietary tools to deal with this complexity.

MLflow was designed to take some of the pain out of machine learning in organizations that don’t have the coding and engineering muscle of the hyperscalers. It works with every major ML library, algorithm, deployment tool and language.

One of the project’s goals is to improve collaboration between data scientists and engineers that deploy their creations in production.

In a true open source fashion, MLflow users didn’t wait for a stable release to start experimenting: Databricks says the platform has already been deployed at thousands of organizations to manage their machine learning workloads, and the company is offering it as a managed service.

Group effort

Databricks might have started the project, but today, it has more than 100 contributors, including a few from Microsoft.

“People are excited about having an open-source project in this space,” Mattei Zacharia, co-founder and chief technologist of Databricks, told El Reg last year.

“They’re excited about having an ML platform – it’s something that resonates with them, and that many wanted to build already – and having one that is a community effort will be much better than what any company could build on its own.”

The next major addition to MLflow will be a Model Registry that allows users to manage their ML model’s lifecycle from experimentation to deployment and monitoring.

The post Databricks wants one tool to rule all AI systems – coincidentally, its own MLflow tool appeared first on Artificial Intelligence.

AI gets rigorous: Databricks announces MLflow 1.0

aiuniverse — Sat, 08 Jun 2019 10:04:47 +0000

Source:-

MLflow, the open source framework for managing machine learning (ML) experiments and model deployments, has stabilized its API, and reached a version 1.0 milestone, now generally available.

One year ago yesterday, at the 2018 Spark and AI Summit in San Francisco, Matei Zaharia, Databricks‘ co-founder/Chief Technologist and creator of Apache Spark, presented his new development focus, an open source project called MLflow. Today, the project has reached a major maturity milestone, with the release of a full version 1.0 to general availability.

Also read: Apache Spark creators set out to standardize distributed machine learning training, execution, and deployment

ORDER FROM ENTROPY

The data science workflow which, to this day, is chock full of ad hoc tasks in siloed development environments. While things are slowly changing, it’s all too common for data scientists to tinker on their laptops, with algorithms and hyperparameter values, until they have a trained ML model that they like, and then manually deploy to production.

MLflow aims to impose rigor on this process, allowing each training iteration to be logged and model deployment, to any number of cloud or private environments, to be automated. This allows the work to be discoverable by other data scientists (which hopefully will avoid them redoing the same work) and for automation of retraining and subsequent redeployment of the model.

V1 NAILS IT DOWN

MLflow allows this work to be done at the command line, through a user interface, or via an application programming interface (API). All three of these interfaces were subject to significant change during MLflow’s first year of development, but with this 1.0 release, developers can rely on these interfaces being stable from here on.

In addition, MLflow 1.0 offers several new features. Although some of these are pretty technically granular, I’ll try to summarize them:

Support for the Hadoop Distributed File System (HDFS) as an “Artifact Store”, allowing MLFlow to store its files in on-premises Hadoop clusters, in addition to cloud storage, local disks, Network File System (NFS) storage and Secure FTP
Support for the ONNX (the Open Neural Network eXchange) machine learning model format — originally backed (and used) by Microsoft, Amazon and Facebook — as an MLflow model “flavor”
Improved search features, allowing a SQL-like syntax to be used for filter expressions based on attributes and tags, in addition to metrics and parameters
Support for tracking metric values based on progressions other than time (officially this is referred to as “Support for X Coordinates in the Tracking API”). This is illustrated in the figure at the top of this post, showing how the MLflow UI allows the X axis of its Metrics visualization to be set to Step, in addition to two variants of Time.
Multiple metrics can be logged in “batch,” meaning they can be recorded via a single API call, instead of call per metric-value pair.

RESPECT AS A STANDARD, WITH MORE IN THE PIPELINE

That’s a nice set of features, and there’s more to come. The MLflow roadmap includes a model registry that can facilitate continuous integration/deployment (CI/CD), model check/code review, as well as insight into the usage and effectiveness of different model versions. There are plans for multi-step workflow support as well.

Databricks says MLflow now has over 100 contributors, and has been deployed at thousands of organizations. Add to that participation from Microsoft and support for MLflow in its Azure Machine Learning platform, and this project looks to have achieved the status of a standard, in a discipline strongly in need of them.

The post AI gets rigorous: Databricks announces MLflow 1.0 appeared first on Artificial Intelligence.