I am able to perform a Multi-hot encoding of ratings to movies by:
from sklearn.preprocessing import MultiLabelBinarizer
def multihot_encode(actual_values, ordered_possible_values) -> np.array:
""" Converts a categorical feature with multiple values to a multi-label binary encoding """
mlb = MultiLabelBinarizer(classes=ordered_possible_values)
binary_format = mlb.fit_transform(actual_values)
return binary_format
user_matrix = multihot_encode(lists_of_movieIds, all_movieIds)
where arr_of_movieIds
is a batch_size sized list of variable length lists of movie IDs (strings) and all_movieIds
are all the possible movie ID strings.
However, instead of just 1 on the resulting matrix I want to have the actual rating that a user gave to the movie. Just like list_of_movieIds
I also have access to a "parallel" to that list_of_ratings
.
How do I go about doing that efficiently? Is there another MultiLabelBinarizer which takes those as args? Can I do some fancy linear algebra to get there?
I tried to do it like:
user_matrix[user_matrix == 1] = np.concatenate(list_of_ratings)
but the ratings are misplaced because list_of_ratings
is not ordered the same way as all_movieIds
...