Knime Archives - Artificial Intelligence

KNIME And H2O.Ai Accelerate And Simplify End-To-End Data Science Automation

aiuniverse — Fri, 16 Oct 2020 06:16:52 +0000

Source: aithority.com

KNIME and H2O.ai, the two data science pioneers known for their open source platforms, announced a strategic partnership that integrates offerings from both companies. The joint offering combines Driverless AI for AutoML and KNIME Server for workflow management across the entire data science life cycle – from data access to optimization and deployment. With this partnership, KNIME and H2O.ai offer a complete no-code, enterprise data science solution to add value in any industry for end-to-end data science automation.

Preparing data for AI, selecting the right model, pushing it into production, and continuously optimizing it is a process that typically requires many stakeholders and several tools. Parts of it can be automated, but flexibility is paramount to select the techniques that answer a company’s questions in the best way. The lack of an end-to-end tooling prevalent in most data practices also makes it very difficult to ensure data lineage. This H2O.ai and KNIME integration now provides a solution that covers all these challenges as well as increases data scientists’ productivity, reduces overall IT spend, and creates and uses more accurate predictions.

The expanded integration between H2O.ai and KNIME brings together all-encompassing, intuitive, automated machine learning from H2O.ai with the guided analytics from KNIME.

Customers of H2O.ai and KNIME can now:

Develop an integrated data science workflow in KNIME Analytics Platform and KNIME Server, from data discovery and data preparation to production-ready predictive models.
Deliver the power of automatic machine learning to business analysts, enabling more citizen data scientists with H2O Driverless AI.
Reduce model deployment times, leveraging H2O Driverless AI and KNIME Server for reliably managing the workflow and creation process in production.

“We have been using KNIME and H2O Driverless AI for years, and we are very excited about this new integration and the automation and simplification that it will bring to our data science workflow,” said Alejandro Lopez, data science leader of Vision Banco.

“H2O Driverless AI users can now get an integrated data access and preparation platform with KNIME. This allows seamless operationalization and continuous learning demanded by our customers adapting at the speed of change today,” said Sri Ambati, CEO and founder of H2O.ai.

“The integration of Driverless AI offers KNIME users a strong, additional option to automate machine learning out of the box with a huge range of powerful algorithms. We believe that flexibility of choice brings most value to our users and customers, and H2O is a great addition to the mix,” said Michael Berthold, CEO and co-founder of KNIME.

H2O is a leading open source AI platform, and its Driverless AI is a leading automatic machine learning (AutoML) platform. H2O Driverless AI automates time-consuming machine learning workflows with automatic feature engineering, model tuning, and model selection to achieve the highest predictive accuracy within the shortest amount of time. H2O Driverless AI empowers data scientists, statisticians and domain scientists to work on projects faster and more efficiently by using automation to complete tasks that can take months in minutes or hours. It can now be used within a KNIME workflow.

KNIME Analytics Platform and KNIME Server provide a visual workflow platform for ETL, further machine learning choices, deployment, collaboration, and cloud execution. Users can blend and transform data from hundreds of data sources using a visual, no-code, fully auditable approach. KNIME also offers a wide range of options for how the output can be deployed — from REST to web applications, BI dashboards, and other third-party tools. With Integrated Deployment, teams can automatically and continuously deploy and update models including the process of data access and preprocessing. Driverless AI adds a powerful choice for automating machine learning.

The post KNIME And H2O.Ai Accelerate And Simplify End-To-End Data Science Automation appeared first on Artificial Intelligence.

Moving data science into production

aiuniverse — Sat, 02 May 2020 11:13:08 +0000

Source: techcentral.ie

1 May 2020 | 0

Deploying data science into production is still a big challenge. Not only does the deployed data science need to be updated frequently but available data sources and types change rapidly, as do the methods available for their analysis. This continuous growth of possibilities makes it very limiting to rely on carefully designed and agreed-upon standards or work solely within the framework of proprietary tools.

KNIME has always focused on delivering an open platform, integrating the latest data science developments by either adding our own extensions or providing wrappers around new data sources and tools. This allows data scientists to access and combine all available data repositories and apply their preferred tools, unlimited by a specific software supplier’s preferences. When using KNIME workflows for production, access to the same data sources and algorithms has always been available, of course. Just like many other tools, however, transitioning from data science creation to data science production involved some intermediate steps.

In this post, we are describing a recent addition to the KNIME workflow engine that allows the parts needed for production to be captured directly within the data science creation workflow, making deployment fully automatic while still allowing every module to be used that is available during data science creation.

Why is deploying data science in production so hard?

At first glance, putting data science in production seems trivial: Just run it on the production server or chosen device! But on closer examination, it becomes clear that what was built during data science creation is not what is being put into production.

I like to compare this to the chef of a Michelin star restaurant who designs recipes in his experimental kitchen. The path to the perfect recipe involves experimenting with new ingredients and optimising parameters: quantities, cooking times, etc. Only when satisfied, are the final results — the list of ingredients, quantities, procedure to prepare the dish — put into writing as a recipe. This recipe is what is moved “into production,” i.e., made available to the millions of cooks at home that bought the book.

This is very similar to coming up with a solution to a data science problem. During data science creation, different data sources are investigated; that data is blended, aggregated, and transformed; then various models (or even combinations of models) with many possible parameter settings are tried out and optimised. What we put into production is not all of that experimentation and parameter/model optimisation — but the combination of chosen data transformations together with the final best (set of) learned models.

This still sounds easy, but this is where the gap is usually biggest. Most tools allow only a subset of possible models to be exported; many even ignore the preprocessing completely. All too often what is exported is not even ready to use but is only a model representation or a library that needs to be consumed or wrapped into yet another tool before it can be put into production. As a result, the data scientists or model operations team needs to add the selected data blending and transformations manually, bundle this with the model library, and wrap all of that into another application so it can be put into production as a ready-to-consume service or application. Lots of details get lost in translation.

For our Michelin chef above, this manual translation is not a huge issue. She only creates or updates recipes every other year and can spend a day translating the results of her experimentation into a recipe that works in a typical kitchen at home. For our data science team, this is a much bigger problem: They want to be able to update models, deploy new tools, and use new data sources whenever needed, which could easily be on a daily or even hourly basis. Adding manual steps in between not only slows this process to a crawl but also adds many additional sources of error.

The diagram below shows how data science creation and productionisation intertwine. This is inspired by the classic CRISP-DM cycle but puts stronger emphasis on the continuous nature of data science deployment and the requirement for constant monitoring, automatic updating, and feedback from the business side for continuous improvements and optimisations. It also distinguishes more clearly between the two different activities: creating data science and putting the resulting data science process into production.

Often, when people talk about “end-to-end data science,” they really only refer to the cycle on the left: an integrated approach covering everything from data ingestion, transforming, and modelling to writing out some sort of a model (with the caveats described above). Actually consuming the model already requires other environments, and when it comes to continued monitoring and updating of the model, the tool landscape becomes even more fragmented. Maintenance and optimisation are, in many cases, very infrequent and heavily manual tasks as well. On a side note: We avoid the term “model ops” purposely here because the data science production process (the part that’s moved into “operations”) consists of much more than just a model.

Removing the gap between data science creation and data science production

Integrated deployment removes the gap between data science creation and data science production by enabling the data scientist to model both creation as well as production within the same environment by capturing the parts of the process that are needed for deployment. As a result, whenever changes are made in data science creation, these changes are automatically reflected in the deployed extract as well. This is conceptually simple but surprisingly difficult in reality.

If the data science environment is a programming or scripting language, then you have to be painfully detailed about creating suitable subroutines for every aspect of the overall process that could be useful for deployment — also making sure that the required parameters are properly passed between the two code bases. In effect, you have to write two programs at the same time, ensuring that all dependencies between the two are always observed. It is easy to miss a little piece of data transformation or a parameter that is needed to properly apply the model.

Using a visual data science environment can make this more intuitive. The new Integrated Deployment node extensions from KNIME allow those pieces of the workflow that will also be needed in deployment to be framed or captured. The reason this is so simple is that those pieces are naturally a part of the creation workflow. This is because first, the exact same transformation pieces are needed during model training, and second, evaluation of the models is needed during fine tuning. The following image shows a very simple example of what this looks like in practice:

The purple boxes capture the parts of the data science creation process that are also needed for deployment. Instead of having to copy them or having to go through an explicit “export model” step, now we simply add Capture-Start/Capture-End nodes to frame the relevant pieces and use the Workflow-Combiner to put the pieces together. The resulting, automatically created workflow is shown below:

The Workflow-Writer nodes come in different shapes that are useful for all possible ways of deployment. They do just what their name implies: write out the workflow for someone else to use as a starting point. But more powerful is the ability to use Workflow-Deploy nodes that automatically upload the resulting workflow as a REST service or as an analytical application to KNIME Server or deploy it as a container — all possible by using the appropriate Workflow-Deploy node.

A “complete deployment” checklist for data science

Many data science solutions promise end-to-end data science, complete model-ops, and other flavours of “complete deployment.” Below is a checklist that covers typical limitations.

Can you mix and match technologies (R, Python, Spark, TensorFlow, cloud, on-prem), or are you limited to a particular technology/environment only?
Can you use the same set of tools during creation as well as the deployment setup, or does one of the two only cover a subset of the other?
Can you deploy automatically into a service (e.g., REST), an application, or a scheduled job, or is the deployment only a library/model that needs to be embedded elsewhere?
Is the deployment fully automatic, or are (manual) intermediate steps required?
Can you roll back automatically to previous versions of both the data science creation process and the models in production?
Can you run both creation as well as production processes years later with guaranteed backward compatibility of all results?
Can a revised data science process be deployed in less than one minute?

The purpose of this article is not to describe the technical aspects in great detail. Still, it is important to point out that this capture and deploy mechanism works for all nodes in KNIME — nodes that provide access to native data transformation and modelling techniques as well as nodes that wrap other libraries such as TensorFlow, R, Python, Weka, Spark, and all of the other third-party extensions provided by KNIME, the community, or the partner network.

With the new Integrated Deployment extensions, KNIME workflows turn into a complete data science creation and productionisation environment. Data scientists building workflows to experiment with built-in or wrapped techniques can capture the workflow for direct deployment within that same workflow. For the first time, this enables instantaneous deployment of the complete data science process directly from the environment used to create that process.

The post Moving data science into production appeared first on Artificial Intelligence.

Top 15 Analytical Tools Data Scientists Must Use In 2019

aiuniverse — Wed, 29 May 2019 05:51:10 +0000

Source:-analyticsindiamag.com

Big data analysts need the right tools which empower them to analyse and make robust decisions in an organisation. In this article, Analytics India Magazine lists down 15 top analytical tools that all persons who work with Big Data must use in 2019:

1| Apache Spark

Apache Spark is a fast and general-purpose cluster computing system which provides high-level APIs in Java, Scala, Python, and R, and an optimised engine which supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming. Some of the features of this unified analytics engine include

Speed: This tool achieves high performance for both batch and streaming data.
Easy to use: It offers over 80 high-level operators which makes it easy to build parallel applications
Generality: Includes a stack of libraries which can be combined seamlessly in the same application
Flexible to work on almost everywhere. It runs on Hadoop, Apache Mesos, Kubernetes, etc.

2| Apache Storm

Apache Storm is a free and open source distributed real-time computation system which makes it easy to reliably process unbounded streams of data, doing for real-time processing like Hadoop for batch processing. The features of this analytics tool include

Simple: Storm is simple, can be used with any programming language
Fast: A benchmark clocked it at over a million tuples processed per second per node
Scalable: It is scalable, fault-tolerant and guarantees your data will be processed.
Easy to use: This tool is easy to set up and operate.

3| Apache SAMOA

Apache SAMOA is a platform for mining big data streams. It provides a collection of distributed streaming algorithms for the most common data mining and machine learning tasks such as classification, clustering, and regression, as well as programming abstractions to develop new algorithms that run on top of distributed stream processing engines (DSPEs).

The features of this analytics tool include

SAMOA’s main goal is to help developers to create easily machine learning algorithms on top of any distributed stream processing engine.
The users can develop distributed streaming ML algorithms once and execute them on multiple DSPEs.

4| Apache Hadoop

The Apache Hadoop software library is a framework which allows for the distributed processing of large data sets across clusters of computers using simple programming models. The framework is composed of the following modules

Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Hadoop Ozone: An object store for Hadoop.
Hadoop Submarine: A machine learning engine for Hadoop.

5| Apache Cassandra

Apache Cassandra is a distributed database which is highly scalable without any compromising performance. It is a perfect platform for mission-critical data as it has features such as linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure.

Some of the features of this analytics tool include

Decentralised: There are no single points of failure as every node in the cluster is identical.
Performant: Cassandra consistently outperformspopular NoSQL alternatives in benchmarks and real applications, primarily because of fundamental architectural choices.
Fault Tolerant: Data is automatically replicated to multiple nodes for fault-tolerance.
Durable: Cassandra is suitable for applications that can’t afford to lose data, even when an entire data centre goes down.

6| Elasticsearch

Elasticsearch is a highly scalable open-source full-text search and analytics engine which allows you to store, search, and analyse big volumes of data quickly and in near real time. It is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. Some of the features of this analytics tool include

Query: Elasticsearch lets you perform and combine many types of searches — structured, unstructured, geo, metric — any way you want.
Analyse: Elasticsearch aggregations let you zoom out to explore trends and patterns in your data.
Speed: Elasticsearch if incredibly fast due to the implementation of inverted indices with finite state transducers for full-text querying, BKD trees for storing numeric and geodata, and a column store for analytics.
Fast time-to-value: Elasticsearch offers simple REST-based APIs, a simple HTTP interface, and uses schema-free JSON documents, making it easy to get started and quickly build applications for a variety of use-cases.

7| Knime

KNIME Analytics Platform is the leading open solution for data-driven innovation, helping you discover the potential hidden in your data, mine for fresh insights, or predict new futures. It is an enterprise-grade, open source platform which is fast to deploy, easy to scale, and intuitive to learn. KNIME Analytics Platform is easy to use and it is one of the perfect tools for a data scientist.

8| Lumify

LUMIFY is powerful big data fusion, analysis, and visualisation platform which supports the development of actionable intelligence. The features of Lumify include

Speed and Scale: Queries run as fast as your underlying database can support, allowing you to take advantage of your existing data infrastructure for data ingest, streaming, complex queries, etc.
Non-Proprietary Data Storage: Lumify sits on top of standard data platforms and fits into your analytic eco-system. Lumify works with your existing data to enable sharing across your analytic tools and systems.
Bring Your Own Analytics Capability: Lumify’s infrastructure allows you to attach new analytic tools that will work in the background to monitor changes and assist analysts as they sort through complex information.
Real-Time and Secure Collaboration: Analysts can instantly share their workspaces with their colleagues, control individual access, and set separate controls based on security classification.

9| MongoDB

MongoDB is a document database with the scalability and flexibility which is designed for ease of development and scaling. It is open sourced and offers both a Community and an Enterprise version of the database. Some of the features include

MongoDB stores data in flexible, JSON-like documents, meaning fields can vary from document to document and data structure can be changed over time.
The document model maps to the objects in your application code, making data easy to work with.
Ad hoc queries, indexing, and real-time aggregation provide powerful ways to access and analyse your data.
MongoDB is a distributed database at its core, so high availability, horizontal scaling, and geographic distribution are built in and easy to use.

10| Neo4j

Neo4j is one of the popular graph database management systems. Neo4j’s Graph Platform is the fastest path available to operationalise enterprise analytic insights by connecting the work of big data IT to data scientists to application developers building impactful applications. The Graph Platform fits seamlessly into enterprise data architectures, alongside, around and above relational warehouses, data lakes, cloud and legacy systems.

11| NodeXL

NodeXL Basic is a free, open-source template for Microsoft Excel which makes it easy to explore network graphs. NodeXL Pro offers additional features that extend NodeXL Basic, providing easy access to social media network data streams, advanced network metrics, and text and sentiment analysis, and powerful report generation.

12| R

R is one of the most popular statistical languages for statistical computing and graphics. It provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, etc,) and graphical techniques, and is highly extensible.

13| RapidMiner

RapidMiner Studio is a powerful data mining tool for rapidly building predictive models. It features hundreds of data preparation and machine learning algorithms to support all your data mining projects. With RapidMiner Studio, you can access, load and analyse any type of data – both traditional structured data and unstructured data like text, images, and media. Some of the features include

Easy to use visual environment for building analytics processes
More than 1,500 operators for all tasks of data transformation and analysis
Support for scripting environments like R, or Groovy for ultimate extensibility
Seamlessly access and use of algorithms from H2O, Weka and other third-party libraries
Extensible through open platform APIs and a Marketplace with additional functionality.

14| Tableau

Tableau is one of the most popular BI tools which is used for data visualisation. The tool allows data blending, real-time collaboration, etc. and are able to connect to the files and other Big Data sources in order to gain insights and patterns from data. It can be said as the most powerful, secure, and flexible end-to-end analytics platform for your data.

15| Talend

Talend is an open source data integration and data management platform, which has a number of ETL tools which are designed to simplify the complex needs of a growing, data-driven business. Talend Open Studio for Big Data helps in developing faster with a drag-and-drop UI and pre-built connectors and components.

The post Top 15 Analytical Tools Data Scientists Must Use In 2019 appeared first on Artificial Intelligence.