Matrix Factorization and Advanced Techniques - Week 1-2

Dimensionality reduction is a way to use efficient methodologies from machine learning to distill out the essential signals from high-dimensional vectors.

Matrix factorization can be done in multiple ways, from algebraic SVD to learned representations using methods like gradient descent or probabilistic methods.

Matrix Factorization and Dimensionality Reduction Recommenders

A ratings matrix can be seen as an overfit representation of user tastes and item descriptions. The conclusion to draw from this then is that by reducing the representations to a smaller dimensionality, we can reduce the overfitting while at the same time reduce the computational complexity.

Singular Value Decomposition

Information retrieval community was facing this earlier and used Singular Value Decomposition to reduce item dimensionality.

$R_k = U_k S_k V_k^{'}$ $R = mXn, U_k = mXk, S_k = kXk, V_k^{'} = kXk$

…where m is the number of users, n is the number items and k is the number of dimensions in the lower dimensionality representation. In this formula, the S matrix forms a diagonal list of weights that represents the importance of the new features.

U is a user-feature affinity matrix, meaning how important the features are to each user. Similarly, Q is an item-feature affinity matrix for the importance to each item.

By sorting the $S_k$ matrix by weight, we can analyse and determine a good cut-off $k^{*}$ point for how many dimensions we want to keep for our representations.

By multiplying back the new lower dimensionality matrixes we get the best approximation of the original matrix using only k dimensions, and where both users and items get represented in a common feature vector space.

$R_k \approx R_{k^*} = U_{k^*} S_{k^*} V_{k^*}^{'}$

As the reduced dimensions are orthogonal to each other to contain as much information as possible while being collectively optimal, they don’t fit back to real-world concepts/labels.

Prediction Rule

Different notation in this lecture; $R = P \Sigma Q^T$

$\tilde{r_{ui}} = s(u,i) = \sum_f{p_{uf} \sigma_f q_{if}}$

…so p is the users preference value multiplied with the feature weight, times the item-feature affinity.

Often, a baseline predictor is subtracted before decomposing, so the rule becomes;

$\tilde{r_{ui}} = s(u,i) = b_{ui} + \sum_f{p_{uf} \sigma_f q_{if}}$

…where $b_{ui} = \mu + \overline{r_u}$ . (not exact notation for $r_u$ , because I can’t read the lecturers hand writing) i.e. the global mean plus the mean of the items normalised ratings “plus an item specific rating term”. Very poorly explained.

Dealing with missing values

Impute values for missing value
Subtract global mean everywhere, so missing values doesn’t remain 0/null
Next lecture; ignore them

Adding new data - Folding In

For new $\vec{r_u}$ that we want to calculate $\vec{p_u}$ for without recalculating everything.

$R = P \Sigma Q^T$ $RQ = P \Sigma Q^T Q$ $Q^T Q = 1 \Rightarrow$ $RQ\Sigma^{-1} = p$

…similarly new items can be folded in. This relies on using the proper algebraic SVD, with the simplification techniques described below, the situation gets more tricky.

Conclusion

Algebraically robust
Downside; decomposing is slow

(Stochastic) Gradient Descent Techniques

Truncated SVD is the best rank-k approximation under global RMSE. What if we use gradient descent to find the best rank-k approximation instead of doing the full formal SVD computation?

Benefits
- Fast
- Ignores missing data

Simplifying the error, instead of RMSE we can use SSE for faster computation.

Algorithm Structure

Initialize values to train (item/user feature vectors) to arbitrary non-zero starting points
Try to predict each rating in data set
Use error and update rule to update values for next rating/sample
Iterate until convergence

Simplified SVD

Going back to the standard formula, but including a baseline (user or item mean) as discussed above for scoring.

$R = B + P \Sigma Q^T$

To simplify (since we really care about the user item representations and not the item weights), we include the sigma impact in P and Q getting;

$R = B + P Q^T$

Thus getting scoring rule;

$s(u,i) = b_{ui} + \sum_f{p_{uf}q_{if}}$

FunkSVD - Pseudo algorithm

Train features one at a time
Train first feature until convergence
Then train the next …
Ignore missing values (mostly)
Learn offset from personalized mean

Update step

$\epsilon_{ui} = r_{ui} - s(u,i)$ $\Delta q_{if} = \lambda (\epsilon_{ui}p_{uf} - \gamma q_{if})$ $\Delta p_{uf} = \lambda (\epsilon_{ui}q_{uf} - \gamma p_{if})$

…where $\lambda$ is the learning rate and $\gamma$ is a regularization term to bias against extreme models.

Caveat

The big caveat of using this simplification technique is that P and Q no longer are orthogonal, meaning folding is no longer guaranteed to work for new data.

In practice though, the features might be orthogonal enough that folding-in works with acceptable accuracy. This needs to be verified by model though and can not be assumed to be the case.

Algorithm Structure

initialize matrices
for f = 1 ... k:
  until feature has converged:
    for $$r_{ui} \in R:$$
      predict $$r_{ui}$$
      update $$p_{uf}$$, $$q_{if}$$

Alternative approaches

Gradient descent over all features together
Least squares (or alternating least squares)
Expectation maximization
Any other optimization method

Probabilistic Matrix Factorization

Basic idea:
- Assume data is generated by random process (with known structure)
- Learn parameters that would generated data that looks like what you have
Non-personalized probabilistic model
- P(i) is # of times i was bought, scaled to probability
- Divide by total # of purchases
- $p(i|u) = \frac{purchases_i}{purchases}$

Personalized probabilistic model - Probabilistic LSA

Goal: Compute $p(i|u)$ and/or $p(r_{ui}|u)$

Decompose with latent factors $p(i|u) = \sum_z{p(i|z)p(z|u)}$

$p(z \vert u)$ : user picks a random feature

$p(i \vert z)$ : picks a random movie from the feature

User preference decomposed into feature preference (latent features, like SVD).

Probabilistic LSA has the same form as SVD;

$R = PQ^T$ where $P = p(z \vert u)$ and $Q^T = p(i \vert z)$ .

These matrices are stochastic, not orthogonal
- That is, they encode probability distributions
Learned with expectation maximization

PSLA for Ratings

Basic idea: model ratings as a distribution (e.g. gaussian)
Distribution parameters determined by item, user and/or latent feature

PMF

Alternate formulation of probabilistic factorization of ratings matrix
Models ratings as drawn from normal distributions
- Mean determined by user and item via features
- Not simple probability matrices like PLSI
- Gaussian mixture model
Faster to train than PLSI

Assume ratings are normally distributed.

$r_{ui} \sim N(p_u q_i^T , \sigma^2)$

User & item vectors define mean of distribution
Variance models noise of ratings
User and item feature values also random:

$p_{uf} \sim N(0, \sigma_p^2)$ $q_{if} \sim N(0, \sigma_q^2)$

Matrix Factorization and Dimensionality Reduction Recommenders

Singular Value Decomposition

Prediction Rule

Dealing with missing values

Adding new data - Folding In

Conclusion

(Stochastic) Gradient Descent Techniques

Algorithm Structure

Simplified SVD

FunkSVD - Pseudo algorithm

Update step

Caveat

Algorithm Structure

Alternative approaches

Probabilistic Matrix Factorization

Personalized probabilistic model - Probabilistic LSA

PSLA for Ratings

PMF

Comments

Related Posts

Matrix Factorization and Advanced Techniques - Week 6 10 May 2020

Matrix Factorization and Advanced Techniques - Week 5 09 May 2020

DeepFM Paper - Notes 29 Apr 2020