How to do Multi-hot Encoding but with actual values instead of ones

Question

I am able to perform a Multi-hot encoding of ratings to movies by:

from sklearn.preprocessing import MultiLabelBinarizer


def multihot_encode(actual_values, ordered_possible_values) -> np.array:
    """ Converts a categorical feature with multiple values to a multi-label binary encoding """
    mlb = MultiLabelBinarizer(classes=ordered_possible_values)
    binary_format = mlb.fit_transform(actual_values)
    return binary_format

user_matrix = multihot_encode(lists_of_movieIds, all_movieIds)

where arr_of_movieIds is a batch_size sized list of variable length lists of movie IDs (strings) and all_movieIds are all the possible movie ID strings.

However, instead of just 1 on the resulting matrix I want to have the actual rating that a user gave to the movie. Just like list_of_movieIds I also have access to a "parallel" to that list_of_ratings.

How do I go about doing that efficiently? Is there another MultiLabelBinarizer which takes those as args? Can I do some fancy linear algebra to get there?

I tried to do it like:

user_matrix[user_matrix == 1] = np.concatenate(list_of_ratings)

but the ratings are misplaced because list_of_ratings is not ordered the same way as all_movieIds...

score 1 · Answer 1 · answered Nov 02 '21 at 11:20

1

Without using MultiLabelBinarizer

import numpy as np
classes=['comedy', 'xyz','thriller', 'sci-fi']
id_dict = {c:i for i,c in enumerate(classes)}
lists_of_movieIds = [{'sci-fi', 'thriller'}, {'comedy'}]
list_of_ratings = [[4,3],[5]]

data = np.zeros((len(lists_of_movieIds), len(classes)))
for i, (m_ids,rs) in enumerate(zip(lists_of_movieIds, list_of_ratings)):
  for m_id,r in zip(m_ids,rs):
    data[i, id_dict[m_id]] = r

print (data)

Output:

[[0. 0. 3. 4.]
 [5. 0. 0. 0.]]

answered Nov 02 '21 at 11:20

mujjiga

16,186
2
33
51

Thank you for the answer but the point was to avoid python for-loops and use vectorized operations because I believe that would slow things down more. – Michael Nov 02 '21 at 13:42
I ended up sorting my dataset on movieId when creating it and using the masking approach I mentioned in the post as a temporary (probably permanent if I don't find anything better) and risky solution. – Michael Nov 02 '21 at 13:43

How to do Multi-hot Encoding but with actual values instead of ones

1 Answers1