Evaluating Recommender Systems - Week 4

Design Evaluation

  • Match evaluation to the question, problem or challenge
  • Which metric matches to what you actually care about?

Example: News Recommendation

  • Are we recommending useful articles?
  • Are there ways to improve the recommender?
    • Diversity, serendipity, familiarity, perceived personalisation, balkanization?
      • Balkanization: Are we dividing people into groups that lose common ground?
  • You can’t design a good recommender without understanding the domain

Example: Medical Recommender

  • What can we do without a live system?
  • You might find that you recommend the best doctors to everyone - but the doctors have a limited number of patients they can meet.
    • “Common problem in the dating space where some users are consistently highly ranked. We can’t tell everyone to go date Tom Hanks because he doesn’t really have the time.”
    • How to match people where they are, not just whom they’d like.
    • How does this change the relevant metrics for evaluation?

Example: Music Recommendation

  • Not the same problem - everyone can listen to the same song
  • With music - people oftentimes want to listen to things they’ve listened to previously, which is different from many other domains.
    • At the same time, you probably don’t want one song on repeat.
  • What is the relative cost of a bad recommendation? A bad song recommendation matters much less than a bad doctor recommendation.
  • It starts with a deep understanding of the domain in order to find the right metric to answer the question you have.

Evaluating Recommender Systems - Week 3

Online Evaluation and User Studies

  • Basic quantitative metrics might not match the user experience or business objectives that well
  • Thus can be complemented with other user metrics
  • Measuring online with users also allows you to test how users experience new items being recommended that they haven’t rated historically
More …

Evaluating Recommender Systems - Week 2

Serendipity

Evaluating by holding out some data share a common flaw: They look at recommender systems as if it was just machine learning.

We don’t want to recover what the person already knows but hasn’t yet told us - we want to recommend things the user didn’t already know about. Looking at historic data of the things the user has already displayed interest in is thus by definition limiting and biasing us away from helping the user actually discover something serendipitously.

Yet again, in the commerce domain; rather than recommending things the customer would have bought anyways, what we really want is to drive uplift by helping them discover new things they wouldn’t otherwise have bought.

If we evaluate our system based on how well we can recreate a hold-out dataset in the past, we won’t ever be able to recommend something better than what the user actually found in the past.

More …

Evaluating Recommender Systems - Week 1

  • Metrics
    • Accuracy, Decision support, Rank, others
  • Evaluating without users
    • Evaluating offline data
    • Framework for hidden-data evaluation
  • Evaluation with users
    • Lab and Field Experiments (A/B Trials)
    • User surveys, log analysis
More …