Evaluating Recommender Systems - Week 1

  • Metrics
    • Accuracy, Decision support, Rank, others
  • Evaluating without users
    • Evaluating offline data
    • Framework for hidden-data evaluation
  • Evaluation with users
    • Lab and Field Experiments (A/B Trials)
    • User surveys, log analysis

Goals of evaluation

Basic Prediction and Recommendation Metrics

Early days

  • Accuracy and error measures
    • MAE, RMSE, MSE
  • Decision-support metrics
    • ROC AUC, Breese score, later precision/recall
  • Error meets decision-support/user experience
    • “Reversals”
  • User-centered metrics
    • Coverage, user retention, recommendation uptake, satisfaction

A commercial look

  • Accuracy is not important
    • Does it drive results? Recommending something that would’ve been bought anyways isn’t useful
  • Lift, cross-sales, up-sales, conversions
  • Led to thinking about different measures anchored not only to user experience, but recommender goals

Theme 1: Prediction vs. Top-N

  • Key distinction:
    • Prediction is mostly about accuracy, possibly decision support, and mainly about individual isolated predictions
    • Top-N is mostly about ranking, decision support (are the good things at the top?) and focuses comparatively on the complete list of recommendations

Theme 2: More than just metrics

  • Even simple evaluations are hard
    • MAE is easy to compute per item
      • But should it be averaged - across users or predictions?
      • How handle lack of data or coverage?
  • Comparative evaluation is even harder
    • Proper baseline
    • Different coverage, etc…
      • If one algorithm can recommend for each item, and another only for items with sufficient amount of data - how do you fairly compare the two algorithms?
  • It’s not just about the core metric calculation, it’s all the different choices made when applying it and how they affect the outcome

Theme 3: Unary data

  • Many metrics were designed for multi-point data (e.g. a ranking scale 1-5) and won’t work as well for unary data
  • Special metrics exists for unary data

Theme 4: Dead data vs. live recommendation

  • Retrospective (dead data) evaluation looks at how recommender systems would have predicted or recommended for items already consumed/rated
  • Prospective (live experiment) evaluation looks at how recommendations are actually received
  • These have fundamental differences (will be covered in later lecture)

Goals of evaluation

  • To inform
    • Selection of recommender algorithms
    • Turning of algorithm
    • Design of system as a whole
  • To enforce disciplined thinking
    • About recommendation
    • About user experience

Hidden Data Evaluation

  • Use existing data collected
  • Use data to simulate behaviour
  • Test whether recommender can predict behaviour
    • Can the recommender predict the rating a user will assign accurately?

Prerequisites

  • Data …

Evaluation Structure

Split data into train/test sets, where the training data is used to train the algorithm, and then is evaluated using the test set.

What do we measure?

  • How close prediction is to actual rating?
  • Are purchased items in recommendation list?
  • Where are purchased items ranked in recommendation list?

Cross-validation

  • Ensure that the test set is not by chance unrepresentative by doing (for example) K-fold split and cross-validate predictions.

Benefits

  • Efficient, can rerun for however many algorithms we want
  • No users required

Drawbacks

  • Does predictive accuracy matter?
  • If a user would have been shown a recommendation list - would they still have picked the items they actually bought?
    • I.e., thinking about the impact the recommender system starts having on future behaviour once it’s actually introduced.

Prediction Accuracy Metrics

Common metrics

MAE - Mean Absolute Error

Mean divergence of prediction from actual rating. Robust in that it is not sensitive to outliers.

MSE - Mean Squared Error

Squares the error instead of taking the absolute value. A consequence of this is that large errors get penalised more than small errors.

RMSE - Root Mean Squared Error

Similar to MSE, but a more intuitive since it’s the standard deviation instead of the variance. Similar to MSE, penalizes large errors more heavily.

Summation

  • Usually, average over all ratings
  • Alternatively, average over user averages
    • If one user has 3000 ratings and another has 10, then the naive averaging would mean the first user has 300 times more impact on the algorithm during training. Averaging over users equalises their impact.
    • I.e., is the goal that the average prediction should be as good as possible or that the average user should get as good predictions as possible?
  • Advice - look at both and analyse the tradeoffs you’re making.

Comparing Different Algorithms

  • Some algorithms might not produce recommendations for each item. How to treat that? If coverage is different;
    • Compare only the common subset
    • Supplement algorithm with a default for full coverage
      • In the real world, we will likely have a fallback mechanism in place in order to always produce a result list. This can thus be used as a stand-in for the algorithm when the coverage is lower.

Irrelevant items dominating errors (!)

One further consideration, generally we want to be precise in the high ratings, as those items will be the ones recommended.

For items with low scores, the precision might be of much less interest, as they won’t be candidates for recommendations no matter if they are predicted to be 1 or 3. A prediction 3 vs 5 matters greatly though as that might determine whether the item shows up as a top result or not.

This is likely one of the reason why it seems a common approach with a two-part approach, where one system generates the initial list of candidates to consider, and the second system specializes in the prediction and ranking of the candidates. The first system just need to be good enough to disqualify what are clearly not interesting candidates, while leaving the finer scoring/ranking to the second system.

Decision-Support Metrics

  • Help users make good decisions
  • Reiterates the point above, the top of the list is what matters most
    • Absolute predictions are not so relevant, the relative ranking in the result list is what matters to the user

Errors and Reversals

  • How often is the system wrong?
    • What does it mean to be wrong? Highly subjective in many cases
    • Not commonly used anymore
  • Reversals are large mistakes, e.g. off by 3 points on a 5-point scale
    • Intuition is that a large mistake lead to reduced user satisfaction and loss of confidence
    • Again, reported as either total or user averages
    • Also not commonly used anymore

Precision and Recall

  • Information Retrieval Metrics
    • Precision
      • Percentage of selected items that are relevant
    • Recall
      • Percentage of relevant items that are selected
    • Balance with F-metrics, e.g.
  • Usually favoured in theory, but might not always be so feasible to calculate
    • Problem 1 - Need ground truth defined for all items
      • If we had that - why bother with a recommender system in the first place?
      • Ways this is commonly addressed
        • Creating limited comparison set of rated items for which the metrics are calculated
          • This set risk suffering from bias though as what has been purchased/rated might not be random
        • Human-rating experiments for randomly sampled subsets of items
    • Problem 2 - Covers entire data set - not targeted on top N
      • Addressed through P@n, R@n
          • Share of top-N items that are good
        • Recall@n is effectively the same
          • Relevant items at n over total relevant item

Receiver Operating Characteristic (ROC)

Curve that shows the tradeoff between precision and recall depending on various cutoff points.

The area under the curve is an overall measure of the recommender effectiveness.

Final reflections

  • Generally, the different metrics tend to correlate well with eachother
  • Precision@n and overall precision are probably the most commonly used
  • None of these metrics overcome the problem of being based on rated items only

Rank-Aware Top-N Metrics

  • As opposed to accuracy of prediction or how good the system is at finding things, we now look at where in the list the items appear, i.e. how good the system models the relative preference.
  • Two families of metrics
    • Binary relevance
      • Good or not
        • “Is it good?”
    • Utility
      • Measurement of absolute or relative ‘goodness’ of items (e.g. ratings)
        • “How good is it?”

Mean reciprocal rank

  • “Where is the first relevant item?”
    • Needs binary relevance judgements

For each user u:

  • Generate list of recommendations
  • Find rank of its first relevant recommendation (the first rec has rank 1)
  • Compute reciprocal rank as
    • This means that the score decrease much more for a low , meaning lower rankings are considered much less important.
      • 1/1, 1/2, 1/3, 1/4 … 1/100

Overall algorithm performance is mean reciprocal rank:

Benefits

  • Very simple
  • Clearly models targeted search or recommendation where a user searches for a specific thing and should find it as fast as possible.

Drawbacks

  • Less clearly appropriate for general recommendation scenarios without a binary relevance judgement.

Average precision

Recap;

Precision; what fraction of n recs are ‘good’?

  • Requires fixed n
  • Treats all errors equally
    • But accuracy of first few items are more important

Average precision;

For each user;

  • For each relevant item
    • “Compute precision of list through that item”
      • (Look at example below)
  • Average sub-list precisions

Result:

  • Relevance of 1st item counts in many measures
  • Relevance of 2nd counts one less
  • etc…

Example

For a ranked list 1-5 as; A+, B-, C+, D+, E-

…where + indicates a good item and - a bad one.

We iterate through the relevant items, getting the share of good items found in the list up to that point, and then dividing by number of good items. For the above list, that becomes;

The result of this formula is that items in the top ranks gets included in all the lower order lists, increasing their impact on the total score.

  • Take Mean Average Precision of all users’ average precision (AP) values;

Rank Correlation

If we can rank items for a user

  • Absolute judgement (ratings)
  • Relative judgement (pairwise preferences)

Then we can compare system order O to the user’s preference order .

This is done by computing a ranking correlation, commonly Spearman correlation which is a ranked version of the Pearson correlation.

  • Assign item rank values
  • Ties get average of ranks
  • Based on the scores, compute Pearson correlation

Problems with Spearman

  • Punishes all misplacements equally
  • Goal: weight things at the top of the list more heavily

Solution: Discounted Cumulative Gain

  • Measure utility of item at each position in the list
    • For unary data, 1/0 ratings
  • Discount by position, so things at front are more important
  • Normalize by total achievable utility
  • Results in Normalized Discounted Cumulative Gain (nDCG)

…where i is the ranking in the list.

Other discounts possible includes various decay functions such as exponential decay.

  • This makes intuitive sense that users get exponentially less likely to choose items lower down in the list.

Example

For list; A4, B3, C0, D5 …where the number is the user rating, then DCG becomes;

So worth noting here is that it calculates how good the ranking of existing items in the list is.

Depending on the base of the log chosen, items up to that point (1…b) are not discounted as log(1) = 0 which results in division by zero.

Fraction of Concordant Pairs

What fraction of pairs are in the correct relative order?

Rank Effectiveness

If we have a user order , we can measure rank effectiveness.

  • Ask recommender to order user’s rated items, not pick them from the haystack
  • Compare recommender order to user order
  • Avoids certain problems with missing data

Conclusion

nDCG and MAP increasingly common. MRR also used, particularly in information retrieval.

Comments