IMPACT OF DEEP LEARNING ON PERSONALIZATION
Machine learning-based personalization has gained traction over the years due to volume in the amount of data across sources and the velocity at which consumers and organizations generate new data. Traditional ways of personalization focused on deriving business rules using techniques like segmentation, which often did not address a customer uniquely. Recent progress in specialized hardware (read GPUs and cloud computing) and a burgeoning ML and DL toolkits enable us to develop 1:1 customer personalization which scales.
Recommender systems are beneficial to both service providers and users. They reduce transaction costs of finding and selecting items in an online shopping environment and improves customer experience. Recommendation systems have also proved to improve the decision making process and quality. In an e-commerce setting, for example, recommender systems enhance revenues, for the fact that they are effective means of selling more products. In scientific libraries, recommender systems support users by allowing them to move beyond catalog searches. Therefore, the need to use efficient and accurate recommendation techniques within a system that will provide relevant and dependable recommendations for users cannot be over-emphasized.
At Epsilon we have used machine learning to solve problems of granular product recommendations in a wide range of channels, to drive customer engagements and bottom-line. The usual solutions to predict product recommendations involve creating multiple models for high-level product categories. Such solutions are not scalable, are resource-intensive and don’t make personalized recommendations. In the sections below, we quickly touch upon the methods to build effective recommender systems.
Collaborative filtering recommends items by identifying other users with similar tastes; it uses their opinion to recommend items to the active user. These recommenders learn from the user-item interactions data from where you can create a N x M sparse matrix that captures all possible user-item interactions. Here N is the number of users and M is the number of items. This data is usually very sparse, meaning very few non-zero elements are there in the matrix. This is also evident from the long-tailed distribution that we can see from the interaction frequency plot for all items.
The sparse matrix representation of the data helps in real-world use cases since the sparse matrix only needs to save the non-zero elements which are relatively few. Collaborative recommenders are used by Netflix where user ratings could be explicitly obtained from customers or implicitly derived from user behavior.
There are many different algorithms which solve the collaborative recommender problems. Broadly they can be classified as:
- Memory-based similarity measures
- Matrix Factorization – e.g. SLIM, WARP, Spark ALS, Funk SVD, SVD++
- Neural Net based – these have the potential to capture even non-linear relations e.g. using Embeddings, Variational Auto-Encoder, Reinforcement Learning, etc.
These algorithms leverage the metadata available for items and users and try to match the items based on the user’s taste. Item’s metadata are essential attributes that describe an item and user metadata is data that explains the characteristics of the individual users e.g. demographics. Using the past user-item interaction data and these user and item attribute profiles are created for each item and user and similarity matching is then applied to find the top N recommendations.
The distance calculation can be obtained through many different distance metrics but quite often cosine similarity is used. In case the input data is already normalized one can use a simple linear kernel instead of cosine similarity. Another common distance metric is the Euclidean distance but that may not be suitable for recommender if a lot of one hot encoded dummy variables are involved.
This is a simple machine learning algorithm based on Bayes’ theorem of conditional probability. It is naïve because it assumes that all the predictor features are mutually independent. This assumption allows it to make fast predictions and lend it scalability. The assumption of independence between features may not be valid in many real-world cases. It’s commonly used as a baseline method in text classification problem.
In recommender systems, it can be used to predict the likelihood of purchasing a product conditioned upon the likelihood of past purchases. The output scores can be sorted in descending order and top N products can be recommended. Its fast, scalable and performs well in case of categorical predictors.
Sequence modeling is the task of predicting next item/s in the sequence given the earlier items. This term is most often used in context of RNN / LSTM in Natural Language Processing. Concepts similar to text sequence can be applied to other domains as well e.g. stock predictions, likelihood to buy any product, etc.
A simple RNN is illustrated in the figure below. As can be seen the output from the RNN is fed back as an input to it. The same is illustrated in an unrolled form in the figure (right). X0, X1, X2, . . . . Xt are inputs at different time steps and h0, h1, h2, . . . . ht are the hidden states
There are different configurations which can be used in sequence modeling. These configurations are illustrated in figure below.
- One to One – this is when there is no RNN, the input and outputs are of same size e.g. Image Classification
- One to Many – g. Image captioning where sequence of words is generated to describe the image
- Many to one – e.g. Sentiment classification – given an input text its sentiment is classified as positive or negative
- Many to many (1) – e.g. Neural Machine Translation, Text summarization – given a sequence of text as input another sequence of text is generated as output
- Many to many (2) – e.g. video frames classification – each video frame is classified
When the sequences to be modelled are long, simple RNNs severely suffer from vanishing gradient problem. In such cases, it is better to use a modified RNN architecture which is called LSTM. LSTM has special gates – forget and update which helps it learn from even longer sequences.
In order to predict the likelihood of purchase of a product given the prior sequences we can model it using a many to one architecture. At Epsilon we have successfully built such models for several clients who have seen a significant lift in the purchase rates of different products.
In the retail industry, a customer does not buy a product only once or a few times. A lot of retail products e.g. FMCG items, home décor, apparel, etc. are bought again and again after some time gaps. So, if we know the prior purchase pattern for each product for each customer, we can model it as a sequence and therefore predict the likelihood of purchase of each product in upcoming weeks. The beauty of this solution is that it captures the time aspect of the purchase pattern of customers as well and can recommend products for resell or cross sell depending on the prior purchase sequence. The concept can also be used for demand forecasting at store and product level using multi-variate sequential inputs.
Contrary to sequence modeling, other approaches like collaborative filtering, content filtering, naïve Bayes, etc. cannot capture the time aspect of retail purchases. Looking at the association between past purchases it can help us predict what products the customer might purchase but it can’t be more specific by factoring in the prior purchase sequences. Deep Learning enabled architectures to allow us to include the user and customer metadata as well in a single model. This can help us get more refined models that can better predict the next set of items that the customer is going to purchase.
Recommender systems solve a key problem faced by marketers, that of uniquely addressing a customer with the right product and creative content. This post described how an organization can use its existing transactional data to drive personalization.