Content-Based Recommenders
Stable preferences measured by content attributes.
Basic idea; model people and items based on
- TF-IDF
- Cosine similarity
- Data Normalisation
TF-IDF for content-based filtering
TFIDF can be used to create a weighted vector of attributes as a profile for an item. These profiles can be combined with user signals like ratings to create a user profile vector which then can be matched against future items.
Common tweaks
TF Variations
- Interpretation of Term Frequency - can be a binary yes/no or boolean based on certain threshold
- Logarithmic frequencies log(tf + 1)
- Normalized frequency (divide by document length)
BM25
- Includes frequency in query, document, number of documents and length.
- Various versions (BM11, BM25 etc.) with different weights
Additional challenges
- Phrases and n-grams
- Significance in documents (headers etc.)
- For the Recommender case with user supplied tags, authority of user might also show up here.
- General document authority (e.g. pagerank)
- Two items might match the search as well, but have different popularity/quality score that would affect the ranking
- Implied content (links, usage)
- Generally, the lack of semantic understanding of context in traditional methods means they are only really efficient for literal searches.
- Can we understand an item from the contexts in which it occurs, and from where it is linked?
Building profiles
Cosine distance gives a score between -1 and 1 that can be interpreted as expressing both dislike, indifference and like.
Limitations of simple content-based profiles is that with a native implementation, interaction between different dimensions isn’t considered. Liking chocolate sauce on ice-cream doesn’t mean you like chocolate sauce on pizza.
Strategies to embed semantic meaning inside the feature vectors are enhancing the results. Either via explicit (links between concepts for example) or implicit (e.g. Word2vec type contextual embeddings) means.
Comments