Evaluating Recommender Systems - Week 2

Serendipity

Evaluating by holding out some data share a common flaw: They look at recommender systems as if it was just machine learning.

We don’t want to recover what the person already knows but hasn’t yet told us - we want to recommend things the user didn’t already know about. Looking at historic data of the things the user has already displayed interest in is thus by definition limiting and biasing us away from helping the user actually discover something serendipitously.

Yet again, in the commerce domain; rather than recommending things the customer would have bought anyways, what we really want is to drive uplift by helping them discover new things they wouldn’t otherwise have bought.

If we evaluate our system based on how well we can recreate a hold-out dataset in the past, we won’t ever be able to recommend something better than what the user actually found in the past.

Limitations of Predictive Accuracy

Predicting ratings from the past doesn’t necessarily correlate with generating good recommendations for the future.

The implicit assumption about our data is that the users choices in the past were perfect, since we use that to evaluate the recommendations. But if this would be the case - why bother building a recommender system at all?

Further, once you have a system live you will get biased data - as all your data after the system goes live will be heavily influenced by what you have recommended to the user.

  • It’s easy to quantitatively measure
  • Since it works on past data - you can evaluate multiple algorithms against big historic data sets

User Experience

  • Once you have created some good candidate algorithms based on evaluating versus historic data though, you should start evaluating them with actual users to see their live impact.
  • Metrics that go beyond predicting hold-out sets and can include considerations such as serendipity, diversity between items, and if multiple items in the list fill the same role and thus could be seen as competing.
  • Optimising towards a metric then first makes sense once we’ve already established that the metric approximates user experience well (or whatever other objective you have).

Additional Item and List-Based Metrics

Metrics apart from accuracy that might more closely match business goals or user experience;

  • Coverage
  • Popularity / Novelty
  • Personalization
  • Serendipity
  • Diversity

Coverage

  • Coverage measures the percentage of products for which a recommender can make a prediction (for every user/item pair)
    • Or a personalized prediction (in case of fallback deafult)
    • Or prediction above a certain confidence threshold
  • Computed as a simple percentage
    • Inconsistent averaging over user/unrated items
    • Easiest is to ‘hide’ every item and compute for entire dataset
  • Extensions can be to see what share of items end up in someones top-N list
    • Items that we can’t recommend to anyone seem like a waste to have in inventory…
  • Differences in coverage also impacts how well you can compare algorithms to each other
    • High accuracy but low coverage vs low accuracy but high coverage

Popularity/Novelty

  • Simple metric that can be applied to recommendations
    • Popularity can be measured as percentage of users who buy/rate item, or used from external sources
    • Novelty is often expresssed as the inverse of popularity, could also be expressed as likelihood of familiarity

Personalization

  • How to measure
    • Variance of predictions per item across users
    • Average difference in top-N lists among users
      • Related coverage metrics
    • User-perceived personalization (survey)

Serendipity

  • Definition: “The occurrence and development of events by chance in a happy or beneficial way”
  • In recommender systems; surprise, delight, not the expected results
  • Several way to estimate
    • Difference between user predicted score and how popular/well-known an item is times a relevance score
    • Key concept - need prior “primitive” estimate of obviousness, oftentimes popularity
      • Can also be how far the item is from things I’ve recently consumed, as generally thing similar to that is what I will be expecting recommendations for
  • Don’t need an overall metric to increase serendipity
    • Simply downgrade items that are highly popular
    • This tends to require experimentation and tuning
  • Business goal - keep people interested

Diversity

  • How different are the items in a list?
    • Applies to a top-n list
  • Start with a pairwise similarity metric
  • Intra-list similarity is the average pairwise similarity, lower score is higher diversity
  • Common approach is to penalize/remove items from the top-n list that are too similar to prior items already recommended (never touch #1)
    • Alternatively, cluster items in bundles and then just select the top item(s) from each bundle

Business Objectives and Metrics

  • Experimental data shows that users oftentimes are willing to give up some accuracy in order to also optimize on some of the metrics mentioned above
    • You need to find the right balance in each context
  • Business usually have broader sets of objectives for the recommender system than what has been outlined above, for example;
    • Immediate Lift
    • Net Lift (subtract out cost of returns)
    • Time to next transaction
    • Long-term customer value (lifetime value)
    • Referrals

Experimental Protocols

Offline validation

Using hold-out sets and cross-validation, get metrics on how good a system is for recreating actual history. Oftentimes in RS research, 5 or 10 splits are used.

Splitting Data

Data can be split by ratings, users and/or items (theoretically), depending on what you want to test/recommend for.

Splitting user data

When splitting by users, you generally also use part of the test users data in training, as those user profiles will be what’s used to predict the hold-out user data in the end.

You can also split data by time to capture a temporal dimension, if some purchases are likely to suggest future purchases, but it’s not bidirectional so you are not interested in learning that a purchase might predict another historic purchase.

Good practice

  • Split users into k partitions (e.g. 5)
  • Split user ratings by time
    • Can use random as well for comparison
  • Include user query ratings in train data
  • Document your protocol carefully
    • So you can run it again
    • So others can compare

Unary Data Evaluation and Top-N challenges

Some particular considerations for unary data. Unfortunately, there are still some deep unsolved problems.

Missing Data

For most RS cases, most data is missing. For a rating system, we know if a user has rated an item low. For unary data such as clicks, purchases or other log data though, it’s usually not clear if missing data is because the user is unaware of the item, or if they don’t like it.

If it’s an item that a user don’t know about yet but would actually would like - that would indeed be a good and serendipitous recommendation!

Popularity Bias

  • Recommenders that effectively find relevant but less-popular items perform poorly on metrics
  • Existing data of course will always favour currently popular items

Solution 1: Rank Effectiveness

  1. Ask recommender to rank test items, rather than entire universe
  2. Measure if rank is consistent with user ratings
    • MAP
    • nDCG
    • nDPM

Requires ratings, or negative ratings, so not good for unary data. Instrumenting to capture implicit or explicit negative ratings possible, but requires a lot of work.

Solution 2: Limit Domain

  • Limit candidate set for recommendation
    • Recommend from test + N randomly-selected unknown items
  • The intuition is that good-but-unknown items are unlikely to be included in the random sample - so the actual test item should rank high
  • However, randomly-selected items are likely not popular, which means the test item that likely is quite popular gets an even further boost due to its popularity

Temporal Evaluation

Notion of time is important in Recommender systems. Having a ‘popular’ section that never changes will soon lose the users interest.

Similarly, evaluations can be followed over time, and for example measure diversity and novelty over time.

Comments