Dimensionality reduction is a way to use efficient methodologies from machine learning to distill out the essential signals from high-dimensional vectors.
Matrix factorization can be done in multiple ways, from algebraic SVD to learned representations using methods like gradient descent or probabilistic methods.
Matrix Factorization and Dimensionality Reduction Recommenders
A ratings matrix can be seen as an overfit representation of user tastes and item descriptions. The conclusion to draw from this then is that by reducing the representations to a smaller dimensionality, we can reduce the overfitting while at the same time reduce the computational complexity.
Singular Value Decomposition
Information retrieval community was facing this earlier and used Singular Value Decomposition to reduce item dimensionality.
…where m is the number of users, n is the number items and k is the number of dimensions in the lower dimensionality representation. In this formula, the S matrix forms a diagonal list of weights that represents the importance of the new features.
U is a user-feature affinity matrix, meaning how important the features are to each user. Similarly, Q is an item-feature affinity matrix for the importance to each item.
By sorting the matrix by weight, we can analyse and determine a good cut-off point for how many dimensions we want to keep for our representations.
By multiplying back the new lower dimensionality matrixes we get the best approximation of the original matrix using only k dimensions, and where both users and items get represented in a common feature vector space.
As the reduced dimensions are orthogonal to each other to contain as much information as possible while being collectively optimal, they don’t fit back to real-world concepts/labels.
Prediction Rule
Different notation in this lecture;
…so p is the users preference value multiplied with the feature weight, times the item-feature affinity.
Often, a baseline predictor is subtracted before decomposing, so the rule becomes;
…where . (not exact notation for , because I can’t read the lecturers hand writing) i.e. the global mean plus the mean of the items normalised ratings “plus an item specific rating term”. Very poorly explained.
Dealing with missing values
- Impute values for missing value
- Subtract global mean everywhere, so missing values doesn’t remain 0/null
- Next lecture; ignore them
Adding new data - Folding In
For new that we want to calculate for without recalculating everything.
…similarly new items can be folded in. This relies on using the proper algebraic SVD, with the simplification techniques described below, the situation gets more tricky.
Conclusion
- Algebraically robust
- Downside; decomposing is slow
(Stochastic) Gradient Descent Techniques
Truncated SVD is the best rank-k approximation under global RMSE. What if we use gradient descent to find the best rank-k approximation instead of doing the full formal SVD computation?
- Benefits
- Fast
- Ignores missing data
Simplifying the error, instead of RMSE we can use SSE for faster computation.
Algorithm Structure
- Initialize values to train (item/user feature vectors) to arbitrary non-zero starting points
- Try to predict each rating in data set
- Use error and update rule to update values for next rating/sample
- Iterate until convergence
Simplified SVD
Going back to the standard formula, but including a baseline (user or item mean) as discussed above for scoring.
To simplify (since we really care about the user item representations and not the item weights), we include the sigma impact in P and Q getting;
Thus getting scoring rule;
FunkSVD - Pseudo algorithm
- Train features one at a time
- Train first feature until convergence
- Then train the next …
- Ignore missing values (mostly)
- Learn offset from personalized mean
Update step
…where is the learning rate and is a regularization term to bias against extreme models.
Caveat
The big caveat of using this simplification technique is that P and Q no longer are orthogonal, meaning folding is no longer guaranteed to work for new data.
In practice though, the features might be orthogonal enough that folding-in works with acceptable accuracy. This needs to be verified by model though and can not be assumed to be the case.
Algorithm Structure
initialize matrices
for f = 1 ... k:
until feature has converged:
for $$r_{ui} \in R:$$
predict $$r_{ui}$$
update $$p_{uf}$$, $$q_{if}$$
Alternative approaches
- Gradient descent over all features together
- Least squares (or alternating least squares)
- Expectation maximization
- Any other optimization method
Probabilistic Matrix Factorization
- Basic idea:
- Assume data is generated by random process (with known structure)
- Learn parameters that would generated data that looks like what you have
- Non-personalized probabilistic model
- P(i) is # of times i was bought, scaled to probability
- Divide by total # of purchases
Personalized probabilistic model - Probabilistic LSA
Goal: Compute and/or
Decompose with latent factors
: user picks a random feature
: picks a random movie from the feature
User preference decomposed into feature preference (latent features, like SVD).
Probabilistic LSA has the same form as SVD;
where and .
- These matrices are stochastic, not orthogonal
- That is, they encode probability distributions
- Learned with expectation maximization
PSLA for Ratings
- Basic idea: model ratings as a distribution (e.g. gaussian)
- Distribution parameters determined by item, user and/or latent feature
PMF
- Alternate formulation of probabilistic factorization of ratings matrix
- Models ratings as drawn from normal distributions
- Mean determined by user and item via features
- Not simple probability matrices like PLSI
- Gaussian mixture model
- Faster to train than PLSI
Assume ratings are normally distributed.
- User & item vectors define mean of distribution
- Variance models noise of ratings
- User and item feature values also random:
Comments