I need some suggestions to build a good model to make recommendation by using Collaborative Filtering
of spark. There is a sample code in the official website. I also past it following:
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
# Load and parse the data
data = sc.textFile("data/mllib/als/test.data")
ratings = data.map(lambda l: l.split(','))\
.map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))
# Build the recommendation model using Alternating Least Squares
rank = 10
numIterations = 10
model = ALS.train(ratings, rank, numIterations)
# Evaluate the model on training data
testdata = ratings.map(lambda p: (p[0], p[1]))
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
RMSE = ratesAndPreds.map(lambda r: ((r[1][0] - r[1][1])**2).mean())**.5)
print("Root Mean Squared Error = " + str(RMSE))
A good model need the RMSE as small as possible.
Is that because I do not set proper parameter to
ALS.train
method, such as rand numIterations and so on?Or is that because my dataset is small to make RMSE big?
So could anyone help me figure out what cause RMSE is big and how to fix it.
addition:
Just as @eliasah said, I need to add some detail to narrow the answer set. Let us consider this particular situation:
Now, if I want to build a recommendation system to recommend music to my clients. I have their history rate for tracks, albums, artists, and genres. Obviously, this 4 class build a hierarchy structure. Tracks directly belong to albums, albums directly belongs to artists, and artists may belong to several different
genres. Finally, I want use all of these info to choice the some tracks to recommend to clients.
So, what is the best practice to build a good model for these situation and ensure to make RMSE as small as possible for prediction.