Analytical Tools Archives - Artificial Intelligence

Bridging the gap between databases and data science

aiuniverse — Tue, 09 Jun 2020 06:44:46 +0000

Source: universiteitleiden.nl

Relational databases are used to store information or data in such a way that it preserves relations between the data. This property makes it a useful tool for data scientists. There is, however, a gap between the relational database research community and data scientists. This leads to inefficient use of databases in data science. PhD-student Mark Raasveldt tried to bridge the gap between the relational databases and data science. PhD defense 9 June 2020.

Integration with analytical tools

Most data scientists use analytical tools, such as R, Python and C/C++, for their research. These tools are difficult to integrate with current database systems, resulting in slow and cumbersome data analysis. ‘Data scientists have opted to reinvent database systems by developing a zoo of data management alternatives that perform similar tasks to classical database management systems, but have many of the problems that were solved in the database field decades ago,’ says Raasveldt.

‘The database research community has made tremendous strides in developing powerful database engines that allow for efficient analytical query processing.’ Raasveldt tried to combine these innovations in the database science with the analytical tools that are mostly used by data scientists. ‘We investigate how we can facilitate efficient and painless integration of analytical tools and relational database management systems,’ says Raasveldt.

Large datasets

Another issue with the use of standard database systems in computer science is the size of the data that is handled. Most database systems are not optimized for large data sets and large-scale data analysis using remote servers. To optimize the database systems, there are three methods that can be considered.

‘We focus our investigation on the three primary methods for database-client integration: client-server connections, in-database processing and embedding the database inside the client application,’ Raasveldt explains. For every method, he studied the implementations in existing database systems and he evaluated how efficient they are for the large datasets and workloads that are common in data science.

DuckDB

Raasveldts final result was a new data management system, called DuckDB, that was purpose-built for efficient and painless integration with R and Python (and other analytical tools). This management system is meant to be used as a mature database system that is not only used for research purposes.

‘In DuckDB we take all the lessons that we have learned investigating database-client integrations and create an easy-to-use and highly efficient embedded database.’ Raasveldt will continue his work as a postdoc at the CWI, where he will work on further developing DuckDB.

The post Bridging the gap between databases and data science appeared first on Artificial Intelligence.

Analytical model predicts exactly how much a piece of hardware will speed up data centers

aiuniverse — Wed, 08 Apr 2020 09:03:33 +0000

Source: techxplore.com

Large-scale software services fight the efficiency battle on two fronts—efficient software that is flexible to changing consumer demands, and efficient hardware that can keep these massive services running quickly even in the face of diminishing returns from CPUs. Together, these factors determine both the quality of the user experience and the performance, cost, and energy efficiency of modern data centers.

A change on one front requires adjustments on the other, and a new software architecture growing in popularity has posed a challenge to the hardware solutions current in most data centers. Called microservices, this modular approach to designing big enterprise software has left something to be desired in its interactions with another major rising force in datacenter efficiency, hardware accelerators.

To bring these two promising technologies together more effectively, CSE Ph.D. student Akshitha Sriraman, working with researchers from Facebook, has designed a way to measure exactly how much a hardware accelerator would speed up a datacenter. Appropriately named Accelerometer, the analytical model can be applied in the early stages of an accelerator’s design to predict its effectiveness before ever being installed.

Still a somewhat new technology in general computing usage, the effectiveness of hardware accelerators isn’t as easy to predict as CPUs, which have decades of experience behind them. Investing in this sort of diverse custom hardware presents a risk at scale, since it might not live up to its expectations.

But the potential for a big impact is there. Designed to perform one type of function extremely quickly, accelerators could theoretically be called upon for all the redundant, repetitive tasks used in common by bigger applications.

That includes microservices. This software architecture approach conceives of a larger application as a collection of modular, task-specific services that can each be improved upon in isolation. This allows for changes to be made to the larger application without needing to change one huge, central codebase. It also allows for more services to be added more easily.

Sriraman demonstrated that as few as 18% of most microservices’ CPU cycles are spent executing instructions that are core to their functionality. The remaining 82% are spent on common operations that are ripe for accelerating.

“Accelerating these overheads we identified can indeed improve speedup to a significant extent,” Sriraman says. Beyond speed, it would make all of the datacenter’s functions cheaper and more energy efficient. “Acceleration will allow us to pack more work for the same power constraints and improve resource utilization at scale, so data center energy and cost savings will improve greatly.”

The issue with microservices is that their designs can turn out to be quite dissimilar, particularly with regard to how they interact with hardware. For example, a microservice can communicate with an accelerator while continuing to run other instructions on a CPU, or it could bring all of its functions to a halt while it offloads to the accelerator. Both of these cases face different “offload overheads” (the time spent sending a task from one processor to another), which becomes lost time for the datacenter if it’s not accounted for.

“Each of these software design choices can result in different overheads that affect the overall speedup from acceleration,” says Sriraman. This overhead is left out of the picture in prior work, she continues, as is the impact of the different microservice designs on performance.

Additionally, accelerators themselves have to be used judiciously to have a net positive effect.

“Throwing an accelerator at every problem is ridiculous because it takes a lot of time, cost, and effort to build, test, and deploy each one,” she concludes. “There is a real need to precisely understand what and how to accelerate.”

Accelerometer is an analytical model that measures exactly how much performance would be improved by installing a given processor, if at all, with all of these nuances taken into account. That means it measures the positive effect of acceleration as well as the negative effect of spending time shuffling instructions around between computing components. And its capabilities aren’t limited to new accelerators—the model can be applied to any kind of hardware, ranging from a simple CPU optimization to an extremely specialized remote ASIC.

The tool was validated in Facebook’s production environment using three retrospective case studies, demonstrating that its real speedup estimates have less than 3.7% error.

The model is sufficiently accurate to already be put to use by Facebook, with early interest from other companies.

“We have received word that several of the big cloud players have started using Accelerometer to quickly discard bad accelerator choices and identify the good ones, to make well-informed hardware investments,” Sriraman says. Facebook is using the model to explore new accelerators, incorporating it as a first step to quickly sort out good and bad hardware choices.

The post Analytical model predicts exactly how much a piece of hardware will speed up data centers appeared first on Artificial Intelligence.

Top 15 Analytical Tools Data Scientists Must Use In 2019

aiuniverse — Wed, 29 May 2019 05:51:10 +0000

Source:-analyticsindiamag.com

Big data analysts need the right tools which empower them to analyse and make robust decisions in an organisation. In this article, Analytics India Magazine lists down 15 top analytical tools that all persons who work with Big Data must use in 2019:

1| Apache Spark

Apache Spark is a fast and general-purpose cluster computing system which provides high-level APIs in Java, Scala, Python, and R, and an optimised engine which supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming. Some of the features of this unified analytics engine include

Speed: This tool achieves high performance for both batch and streaming data.
Easy to use: It offers over 80 high-level operators which makes it easy to build parallel applications
Generality: Includes a stack of libraries which can be combined seamlessly in the same application
Flexible to work on almost everywhere. It runs on Hadoop, Apache Mesos, Kubernetes, etc.

2| Apache Storm

Apache Storm is a free and open source distributed real-time computation system which makes it easy to reliably process unbounded streams of data, doing for real-time processing like Hadoop for batch processing. The features of this analytics tool include

Simple: Storm is simple, can be used with any programming language
Fast: A benchmark clocked it at over a million tuples processed per second per node
Scalable: It is scalable, fault-tolerant and guarantees your data will be processed.
Easy to use: This tool is easy to set up and operate.

3| Apache SAMOA

Apache SAMOA is a platform for mining big data streams. It provides a collection of distributed streaming algorithms for the most common data mining and machine learning tasks such as classification, clustering, and regression, as well as programming abstractions to develop new algorithms that run on top of distributed stream processing engines (DSPEs).

The features of this analytics tool include

SAMOA’s main goal is to help developers to create easily machine learning algorithms on top of any distributed stream processing engine.
The users can develop distributed streaming ML algorithms once and execute them on multiple DSPEs.

4| Apache Hadoop

The Apache Hadoop software library is a framework which allows for the distributed processing of large data sets across clusters of computers using simple programming models. The framework is composed of the following modules

Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Hadoop Ozone: An object store for Hadoop.
Hadoop Submarine: A machine learning engine for Hadoop.

5| Apache Cassandra

Apache Cassandra is a distributed database which is highly scalable without any compromising performance. It is a perfect platform for mission-critical data as it has features such as linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure.

Some of the features of this analytics tool include

Decentralised: There are no single points of failure as every node in the cluster is identical.
Performant: Cassandra consistently outperformspopular NoSQL alternatives in benchmarks and real applications, primarily because of fundamental architectural choices.
Fault Tolerant: Data is automatically replicated to multiple nodes for fault-tolerance.
Durable: Cassandra is suitable for applications that can’t afford to lose data, even when an entire data centre goes down.

6| Elasticsearch

Elasticsearch is a highly scalable open-source full-text search and analytics engine which allows you to store, search, and analyse big volumes of data quickly and in near real time. It is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. Some of the features of this analytics tool include

Query: Elasticsearch lets you perform and combine many types of searches — structured, unstructured, geo, metric — any way you want.
Analyse: Elasticsearch aggregations let you zoom out to explore trends and patterns in your data.
Speed: Elasticsearch if incredibly fast due to the implementation of inverted indices with finite state transducers for full-text querying, BKD trees for storing numeric and geodata, and a column store for analytics.
Fast time-to-value: Elasticsearch offers simple REST-based APIs, a simple HTTP interface, and uses schema-free JSON documents, making it easy to get started and quickly build applications for a variety of use-cases.

7| Knime

KNIME Analytics Platform is the leading open solution for data-driven innovation, helping you discover the potential hidden in your data, mine for fresh insights, or predict new futures. It is an enterprise-grade, open source platform which is fast to deploy, easy to scale, and intuitive to learn. KNIME Analytics Platform is easy to use and it is one of the perfect tools for a data scientist.

8| Lumify

LUMIFY is powerful big data fusion, analysis, and visualisation platform which supports the development of actionable intelligence. The features of Lumify include

Speed and Scale: Queries run as fast as your underlying database can support, allowing you to take advantage of your existing data infrastructure for data ingest, streaming, complex queries, etc.
Non-Proprietary Data Storage: Lumify sits on top of standard data platforms and fits into your analytic eco-system. Lumify works with your existing data to enable sharing across your analytic tools and systems.
Bring Your Own Analytics Capability: Lumify’s infrastructure allows you to attach new analytic tools that will work in the background to monitor changes and assist analysts as they sort through complex information.
Real-Time and Secure Collaboration: Analysts can instantly share their workspaces with their colleagues, control individual access, and set separate controls based on security classification.

9| MongoDB

MongoDB is a document database with the scalability and flexibility which is designed for ease of development and scaling. It is open sourced and offers both a Community and an Enterprise version of the database. Some of the features include

MongoDB stores data in flexible, JSON-like documents, meaning fields can vary from document to document and data structure can be changed over time.
The document model maps to the objects in your application code, making data easy to work with.
Ad hoc queries, indexing, and real-time aggregation provide powerful ways to access and analyse your data.
MongoDB is a distributed database at its core, so high availability, horizontal scaling, and geographic distribution are built in and easy to use.

10| Neo4j

Neo4j is one of the popular graph database management systems. Neo4j’s Graph Platform is the fastest path available to operationalise enterprise analytic insights by connecting the work of big data IT to data scientists to application developers building impactful applications. The Graph Platform fits seamlessly into enterprise data architectures, alongside, around and above relational warehouses, data lakes, cloud and legacy systems.

11| NodeXL

NodeXL Basic is a free, open-source template for Microsoft Excel which makes it easy to explore network graphs. NodeXL Pro offers additional features that extend NodeXL Basic, providing easy access to social media network data streams, advanced network metrics, and text and sentiment analysis, and powerful report generation.

12| R

R is one of the most popular statistical languages for statistical computing and graphics. It provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, etc,) and graphical techniques, and is highly extensible.

13| RapidMiner

RapidMiner Studio is a powerful data mining tool for rapidly building predictive models. It features hundreds of data preparation and machine learning algorithms to support all your data mining projects. With RapidMiner Studio, you can access, load and analyse any type of data – both traditional structured data and unstructured data like text, images, and media. Some of the features include

Easy to use visual environment for building analytics processes
More than 1,500 operators for all tasks of data transformation and analysis
Support for scripting environments like R, or Groovy for ultimate extensibility
Seamlessly access and use of algorithms from H2O, Weka and other third-party libraries
Extensible through open platform APIs and a Marketplace with additional functionality.

14| Tableau

Tableau is one of the most popular BI tools which is used for data visualisation. The tool allows data blending, real-time collaboration, etc. and are able to connect to the files and other Big Data sources in order to gain insights and patterns from data. It can be said as the most powerful, secure, and flexible end-to-end analytics platform for your data.

15| Talend

Talend is an open source data integration and data management platform, which has a number of ETL tools which are designed to simplify the complex needs of a growing, data-driven business. Talend Open Studio for Big Data helps in developing faster with a drag-and-drop UI and pre-built connectors and components.

The post Top 15 Analytical Tools Data Scientists Must Use In 2019 appeared first on Artificial Intelligence.