Matrix Factorization and Advanced Techniques - Week 5

Learning Recommenders

  • ; parameters and/or recommendation model
  • ; error or utility computing predictions or recommendations with model .
    • When using utility, maximize instead of minimize.
  • An optimization algorithm is used to find the best parameters (e.g. gradient descent or expectation maximization)

Interview with Xavier Amatriain

For Netflix, recommendations are considered more important than search. Search is what happens when recommendations fail. Most videos watched comes from people following recommendations, over explicit search.

They work with rows and rankings, i.e. a row is a category and the order of movies within the row is the ranking. So recommendations works on two levels; order of categories and order of videos within the row.

The basic idea of their recommendations is to optimize the amount of hours that people engage with the service as that corresponds to actual business objectives.

Their recommendations look at implicit signals, not explicit ratings, as this corresponds better to actual behavior. This is also because users over time have begun giving less explicit feedback. A further complication is that accounts are oftentimes shared by multiple persons in the household - which also means looking at the current context and implicit feedback works better.

When starting a ranking problem, popularity is a good start. Then the question is how to start adding personalization into the mix. A linear model combining these scores is a good start.

Learning the weights for the two parts can be done with logistic regression, treating it as a classification problem trying to classify the items the user watched correctly.

There are a lot of more advanced way to try to better learn the ranking of items that closer corresponds to the objectives you might have (learning to rank, list wise approaches etc…), but these are oftentimes not differentiable so will require more complex approaches.

Interview with Daniel Kluver

  • Bayesian Personalized Ranking (BPR)
  • Goal: Personalized ranking function
  • Most popular learning to rank method

BPR is not an algorithm in itself but a loss function, and some suggestions for optimization approach. You will need to implement an algorithm with all part as .

The loss function is BPR-OPT, which approximates an AUC that is optimized with stochastic gradient descent.

A standard model is to use matrix factorization, so that;

AUC and ROC curves

  1. Train a model
  2. Specify a set of “good items”
  3. For each recommendation length; 3.1 Compute % of good items recommended 3.2 Compute % of bad items recommended 3.3 Plot that point

Plotting this across all points creates the ROC curve of True positive vs. false positive rates. Optimizing then becomes a question about maximizing the AUC (i.e. the true positive rate).

The AUC can also be constructed as the percentage of correctly ranked points, where each point D is a ranking between two items.

  • ; we know u preferred i over j
  • ; equals one if foo is true, 0 if otherwise

The one function is hard to optimize as it’s not smooth; changing the scoring function slightly can either have no or a radical effect on the AUC curve.

  • The AUC depends on how well the algorithm ranks items
  • Can not be optimized by standard techniques
  • We need a well behaved approximation

This is where BPR-OPT comes in. Instead of assigning scores , let’s think of it as assigning probabilities to the ranking function. The goal then becomes to learn the function .

BPR uses a logistic function so that;

The natural way to evaluate a likelihood functionality is the log likelihood. This helps with numeric stability as large data sets generates a lot of very small probabilities that can’t be handled well by the computer. Logging it transforms it into a summation rather than product, and so it fits better into computer memory.

Since we are looking at optimizing AUC(D), and is static, we can instead optimize;

This means that using our approximation function, we can then differentiate it and optimize using gradient descent.

Data Preparation

BPR does not train over ratings, it trains over user-item-item triples. Preparing the data though, you need to decide on how to handle ambiguities like;

  • One item the user has seen (clicked, purchased) - one the user has not
  • One the user rated, one the user has not

Lastly, the data order matters for the training velocity. Random order works better than training points taken by user or item, which is why stochastic gradient descent is used.

Lastly

Most other learning to rank algorithms follow a similar structure;

  1. Choose a rank metric
  2. Identify parts that are not smooth
  3. Replace with “close enough” approximations
  4. Learn with gradient descent

Comments