Source – forbes.com
In part one of this two-part series, we covered the complexity of the digital identity problem and some early-market solutions. Read on for part two below!
Machine Learning In Digital Identity
Two broad categories of machine learning models are clustering (unsupervised learning) and classification (supervised learning). Each of these has its pros and cons and, predominantly, the advertising ecosystem has embraced clustering to solve the problem of identity.
Simply put, clustering is the grouping of similar data into groups, where data in the same group is more closely related to each other than to data in other groups. With this approach, analogous consumer activities based on the association between data fields are grouped. Data fields like IP, user agent, location, time and content consumed are often used to do this.
Even though the approach is scientific and in some cases highly complicated, it is extremely difficult to overcome the flaws of the individual data fields themselves. For instance, IP inherently is an unreliable data field over the long term because the address can change. Identifiers clustered in this model, particularly cookies, are largely unreliable.
Though clustering solves part of the cross-device identity problem, it falls short for a large portion of the device and consumer population.
Classification, on the other hand, is the process of identifying which label a new observation belongs to, knowing the classification of observations from a fact-based training data set. This method can be used to generate a predictive model to associate multiple advertising identifiers to one consumer or a household.
Using models based on classification for identity resolution involves managing a statistically relevant training set, which has observations from a group of devices known to be linked to the same consumer and/or household.
Having a thorough training set is easier said than done. Imagine collecting training data from your favorite travel website — a consumer might log into the travel website from one or two separate devices to research vacation ideas and deals. It’s highly likely that the same consumer will visit the site anonymously to get quick info like airport information or in-flight movie choices. If that consumer only travels twice each year, the data isn’t being collected on a frequent enough basis to have it included in a thorough training set. Classification models tend to work better for identity resolution if the thoroughness and freshness of the training data can be maintained at all times.
Hybrid Solution For Maximum Accuracy
Not surprisingly, the most effective digital identity solution comes from utilizing a hybrid of these two models. Semi-supervised learning leverages advantages from both clustering and classification algorithms to achieve higher accuracy. At Qualia, we do this by utilizing both algorithms and leveraging a combination of billions of signals from devices and users’ intent-driven activities to create statistically relevant models that can enhance or validate each other.
In semi-supervised learning, a small amount of labeled data is used along with unlabeled data as the training data set for the model. With this approach, the quality, freshness and statistical relevance of the labeled data can be managed, as it is not as cost and resource prohibitive to maintain this labeled data set.
Another advantage in utilizing this model is to perform household identification using classification techniques with labeled data across household devices and then clustering these household devices into individual groups with significantly smaller labeled data sets. An approach like this handles the changes to unlabeled data in a much better way than strictly using classification or clustering techniques.
In short, digital identity management is a complex problem, and there is no one-model-fits-all approach. However, it is one of the most interesting challenges in the advertising ecosystem, encouraging data scientists to innovate and ideate around creative solutions to this ever-changing device landscape.