Proposal

SUMMARY

In this project, we plan to implement the parallel alternating least squares algorithm for large-scale recommendation, for which the description of the algorithm is described in this paper.

BACKGROUND

Usually, for a recommendation system, it tries to give the prediction for the rating of an item for a user. The method given in the paper tries to solve the problem using the inner product of the embedding vectors of a given user and item. In particular, we represent each user with a dense vector $u_i$ and each item with another dense vector called $v_j$. Then the model predicts the rating that user $i$ will give to item $j$ by taking the inner product of the two vectors: $r_{i,j} = u_i^T v_j$. Assuming we have an matrix with $N$ user and $M$ items and $S$ denotes the set where if pair $(i,j)$ is in the set, we have a observed rating for user $i$ to movie $j$. Then, the objective of the algorithm is to solve the following problem: $$\sum_{i=1}^N \sum_{j=1}^M I((i,j) \in S)(r_{i,j} - u_i^T v_j)^2$$ where $I$ is the indicator function. The parameters we are trying to learn is all the $u$ and $v$ to minimize the differences between the prediction and the actual observed ratings. Note that even though $N$ and $M$ could be large, $u$ and $v$ can lie on some low-dimensional space so as to reduce the computational complexity. Alternating-least-square solves this problem by solving a sequence of least-square subproblems. The reason is that if we fix all the $u$, then the problem is essentially a least-square problem in $v$ and vice versa. Each step should decrease the objective so the algorithm is guaranteed to find a estimate of the parameters.

To be more specific, in ALS every iteration, we first update $U = {u_1,…u_N}$ while fixing $V={v_1,…,v_M}$ to minimize the above loss function, and the global solution of this subproblem is: $$ u_i = (V^T_{j, (i,j) \in S}V_{j, (i,j) \in S})^{-1}V^T_{j, (i,j) \in S}R_{j,(i,j) \in S} $$ for $i=1,…N$. The update rule for $V$ is the same.

CHALLENGE

In the update stage, a bunch of linear equations need to be solved. We need to optimize the matrix inverse and multiplication operations. In particular, we plan to resort to conjugate gradient method, and we may have several linear system on each machine, we should be able to solve them in parallel using openMP exploiting another level of multi-core parallelism.
We need to send messages to each node in the cluster and do synchronization using MPI. Besides, we need to extract the data from the matrix from both the views of column and row. So we need to reduce the communication cost and cache miss cost both.

RESOURCES

We reference to the paper: Zhou Y, Wilkinson D, Schreiber R, et al. Large-scale parallel collaborative filtering for the netflix prize[M]//Algorithmic Aspects in Information and Management. Springer Berlin Heidelberg, 2008: 337-348.
We will implement the algorithm on a cluster and write the code from scratch based on MPI using OpenMP, ISCP and Cuda.

GOALS

For this project we need to implement a high performance training algorithm. We divide the project into several part of optimization:

Distributed training algorithm
Optimize with CPUs
[Extra] Scale up the algorithm

And for each optimization we expect to gain high improvement. In the end, we need to give what we have done for the optimization in each step and the improvement of each step.

PLATFORM CHOICE

The language used to implement the algorithm will be C++ and we will be using MPI and openMP as or parallel programming libraries. MPI is very suitable because we need to pass the updated user vectors to other machines before updating the item vectors and vice versa. OpenMP can be used to solve a bunch of linear system in parallel on a single-machine with multiple cores.
The machine used in this project will be a standard clusters with MPI, openMP installed.

SCHEDULE

Time	Task
4.3-4.6	Design the algorithm of MPI version
4.7-4.16	Implement the naive distributed version
4.16-4.30	Optimize the algorithm for solving linear system (conjugate gradient)
5.1-5.11	[Extra] Scale up to the size when the data cannot fit into the memory