Using AI and Machine Learning to Power Data Fingerprinting
In the era of Big Data, a data catalog is essential for organizations to give users access to the data they need. But it can be difficult to deploy this important tool if populating the catalog entails a significant amount of manual work. The key to a successful data catalog implementation is the rapid population of the data catalog so it can be quickly put into use.
This can be achieved with an automated data fingerprinting process that uses a combination of machine learning augmented with human generated curation plus ratings and reviews.
The #1 Challenge in Successfully Deploying a Data Catalog
One of the biggest challenges in deploying a data catalog is getting it populated with useful information. While many organizations have a business glossary with defined terms and definitions, connecting this kind of business metadata down to the technical metadata, which contains statistical demographics and the actual location where that data lives, is a challenge.
In the data cataloging space, people will refer to the ability to “tag” data. This is the act of connecting a physical instance of a column of data with the associated business glossary term. In most organizations, one business term will have tens if not hundreds of physical instances of that term deployed across an organization.
For example, “First Name” is a business term that is located in all kinds of systems. The problem is that locating all instances of “first name” is a very tedious task. This is made more difficult because very often data is not always nicely labeled. So a column could have a name of “First Name”, “fname”, “fn”, “given name” or even “C01”, and all of those columns could contain first names.
This translates into a big problem for implementing a data catalog. How do you connect all of the business terms to the actual implementation of data associated with that term? While some catalogs try to do this by crowd sourcing, practical experience shows that this doesn’t work because this approach doesn’t scale to deal with the growing amount of data that is coming into organizations.
Critically, crowd sourcing doesn’t deal with so-called “dark data” or data with which no one is familiar. With the incredible velocity with which new datasets are being created, it’s not possible to track all the instances in an organization. This can also occur because either the data is new to an organization—for instance if you purchase data from an outside supplier—or because no one has touched it in a long time.
This is the reason why automation is vital to deal with the problem of data tagging.
Data Tagging & Fingerprinting Using AI and Machine Learning
Waterline Data addresses the challenge above by using artificial intelligence and machine learning to analyze data and do what we call “data fingerprinting.”
Fingerprinting works on the concept that a column of data has a signature, or a fingerprint, and that by examining the data values in a column of data, we can identify what that data is and determine two things: which other columns share this same fingerprint, and what is the business term or label that can be connected to this data.
On this second point—connecting a business term to an unlabeled or mislabeled column of data—Waterline Data fingerprinting can do this for lots of business terms, but not for everything out of the box. For some terms, it has to be trained. For example, it knows what a first name or last name is, or what a credit card number is.
But it doesn’t know what “Claim Number” is for ACME Insurance because the format of a claim number would be unique to ACME. However, once a knowledgeable business user or data steward tags just one column as “Claim Number,” the system now knows what a claim number is. The tag for this business term gets propagated automatically to all of the other unlabeled columns of data that have the same fingerprint.
The reason it is powerful is because you only have to tag a unique attribute once, and the computer learns and propagates the tags automatically. Curation can even be carried over to a brand new data source. Suppose a new s3 bucket with terabytes of new data was just brought online. How you manually sift through it? With Waterline Data, existing fingerprints can be used to automatically match against the new body of data.