Evaluating Recommender Systems - Week 4

28 Mar 2020

Design Evaluation

Match evaluation to the question, problem or challenge
Which metric matches to what you actually care about?

Example: News Recommendation

Are we recommending useful articles?
Are there ways to improve the recommender?
- Diversity, serendipity, familiarity, perceived personalisation, balkanization?
  - Balkanization: Are we dividing people into groups that lose common ground?
You can’t design a good recommender without understanding the domain

Example: Medical Recommender

What can we do without a live system?
You might find that you recommend the best doctors to everyone - but the doctors have a limited number of patients they can meet.
- “Common problem in the dating space where some users are consistently highly ranked. We can’t tell everyone to go date Tom Hanks because he doesn’t really have the time.”
- How to match people where they are, not just whom they’d like.
- How does this change the relevant metrics for evaluation?

Example: Music Recommendation

Not the same problem - everyone can listen to the same song
With music - people oftentimes want to listen to things they’ve listened to previously, which is different from many other domains.
- At the same time, you probably don’t want one song on repeat.
What is the relative cost of a bad recommendation? A bad song recommendation matters much less than a bad doctor recommendation.
It starts with a deep understanding of the domain in order to find the right metric to answer the question you have.

Evaluating Recommender Systems - Week 3

10 Mar 2020

Online Evaluation and User Studies

Basic quantitative metrics might not match the user experience or business objectives that well
Thus can be complemented with other user metrics
Measuring online with users also allows you to test how users experience new items being recommended that they haven’t rated historically

More …

Evaluating Recommender Systems - Week 2

08 Mar 2020

Serendipity

Evaluating by holding out some data share a common flaw: They look at recommender systems as if it was just machine learning.

We don’t want to recover what the person already knows but hasn’t yet told us - we want to recommend things the user didn’t already know about. Looking at historic data of the things the user has already displayed interest in is thus by definition limiting and biasing us away from helping the user actually discover something serendipitously.

Yet again, in the commerce domain; rather than recommending things the customer would have bought anyways, what we really want is to drive uplift by helping them discover new things they wouldn’t otherwise have bought.

If we evaluate our system based on how well we can recreate a hold-out dataset in the past, we won’t ever be able to recommend something better than what the user actually found in the past.

More …

Evaluating Recommender Systems - Week 1

29 Feb 2020

Metrics
- Accuracy, Decision support, Rank, others
Evaluating without users
- Evaluating offline data
- Framework for hidden-data evaluation
Evaluation with users
- Lab and Field Experiments (A/B Trials)
- User surveys, log analysis

More …

Collaborative Filtering - IICF - Week 4

23 Feb 2020

Advanced Collaborative Filtering Topics

More …