Evaluating Recommender Systems - Week 3

Online Evaluation and User Studies

  • Basic quantitative metrics might not match the user experience or business objectives that well
  • Thus can be complemented with other user metrics
  • Measuring online with users also allows you to test how users experience new items being recommended that they haven’t rated historically
  • Usage Logs
    • Great for evaluating actual use of systems
    • Potential for measuring retrospective accuracy
  • Polls, Surveys, Focus Groups
    • Great for a more qualitative understanding
  • Lab and Online Lab Experiments
    • Careful controls, but decontextualised from the real-life situation
  • Field Experiments and A/B Tests
    • Great in order to compare variants and measure results
      • Immediate behaviour
      • Longer-term behaviour
      • Subjective evaluations
    • Massive A/B testing is great for optimising, but mechanical optimisation without understanding risks not understanding the underlying principles

Usage Logs and Analysis

  • By keeping logs, you can do analysis on how a users future actions or ratings relates to their recommendations
    • Are people using recommendations?
    • Are the recommendations matching the users future rating?
    • How do groups behaviour change over time depending on if they use recommendations or not?
  • Setting up proper logging gives you great data, but if it’s setup improperly then the data might end up useless

A/B Studies (Field Experiments)

  • Goal is to see if a system change makes a positive improvement in user activity
  • Focus on a small number of KPIs that you want to move
  • *“A good metric is going to be a credible near-time proxy for long-term customer value”*
  • Measure as much as you can, not only the KPIs
    • In order to detect unforeseen consequences and diagnose why an experiment did or didn’t work
  • Suggested running experiment for at least 1-2 weeks
    • Especially important if there’s a notable visual change to users - you can’t change things to often
    • Also in order to ensure that you capture the behaviour after a possible novelty effect wears off
      • If a new algorithm generates a completely new set of recommendations, that novelty in itself might lead to an initial uplift
    • Multiple full weeks to capture the weekly cycle (sometimes even longer cycles like month might be relevant)
  • Analyse results after full experiment cycle

User-Centered Evaluation (Interview with Bart Knijnenburg)

User-centered evaluation shows that usually, the way the recommendations are presented to the users and other contextual factors might have much more impact than just the algorithm itself. It is therefore important to remember to have a holistic perspective on implementing recommendations.

You can’t really say that your recommendations gives the users a pleasant experience unless you actually measure precisely this.

User data is very noisy - they do things for whatever reasons without being influenced by recommendations. Thus, just looking at their actions might not tell you that much over how good your recommendations are.

For many business metrics like Life-Time Value or Retention over time, it takes a long time to get a follow-up on your action. User satisfaction is oftentimes a quite good proxy that is easier to get in near-time.

You might also discover non-obvious things like that for example a longer total viewing time might correlate negatively with user satisfaction.

By evaluating the subjective measures of the users, you might learn better proxy KPIs to optimise for.

Formative studies and summative studies

Formative studies are non-scientific user studies to improve user experience during development.

Summative studies at scientific experiments that compares versions of the system against each others to evaluate which one is statistically better. I.e., similar to A/B tests but including user questionnaires.

Comments