What is rank in ALS machine Learning Algorithm in Apache Spark Mllib

Question

I Wanted to try an example of ALS machine learning algorithm. And my code works fine, However I do not understand parameter rank used in algorithm.

I have following code in java

    // Build the recommendation model using ALS
    int rank = 10;
    int numIterations = 10;
    MatrixFactorizationModel model = ALS.train(JavaRDD.toRDD(ratings),
            rank, numIterations, 0.01);

I have read some where that it is the number of latent factors in the model.

Suppose I have a dataset of (user,product,rating) that has 100 rows. What value should be of rank (latent factors).

score 28 · Accepted Answer · answered Jun 09 '15 at 12:36

28

As you said the rank refers the presumed latent or hidden factors. For example, if you were measuring how much different people liked movies and tried to cross-predict them then you might have three fields: person, movie, number of stars. Now, lets say that you were omniscient and you knew the absolute truth and you knew that in fact all the movie ratings could be perfectly predicted by just 3 hidden factors, sex, age and income. In that case the "rank" of your run should be 3.

Of course, you don't know how many underlying factors, if any, drive your data so you have to guess. The more you use, the better the results up to a point, but the more memory and computation time you will need.

One way to work it is to start with a rank of 5-10, then increase it, say 5 at a time until your results stop improving. That way you determine the best rank for your dataset by experimentation.

answered Jun 09 '15 at 12:36

Tyler Durden

11,156
9
64
126

Tyler, Thanks for such a good explanation. However, I have one question here, the latent factors, that we assume, are they only user's (choices,interests) or they may include items characteristics also? – hard coder Jun 10 '15 at 05:53
1

Its purely a characteristic of the data. – Tyler Durden Jun 10 '15 at 11:29
if you could include as precise of a response for Lambda, which I believe is the only other parameter ALS uses... This is the best answer I have found for Rank – Dan Ciborowski - MSFT Dec 13 '17 at 16:55
The 'rank' is controlling the number of internal parameters that must be fit from the data. Too many, and you get overfitting your training set, rather than generalized learning. Thus, more is not better, but as your dataset grows, you may be able to improve by increasing it. – George Forman Apr 23 '20 at 23:03

What is rank in ALS machine Learning Algorithm in Apache Spark Mllib

1 Answers1

Linked