I am building a recommender system, and as such I have lists of users, items, and ratings. As such, I need to assign each user and item a categorical ID. There are approximately 100,000 users and 10,000 items, with roughly 1 million ratings. My question is which method is the most scalable?
I think I have 3 options:
- Using sklearn's
preprocessing.LabelEncoder()
- Using pandas
df['items'].astype('category').cat.codes.values
- Using something like a dictionary that I can write back to the dataframe
such as
items = item_reviews.items.unique()
items_map = {i:val for i,val in enumerate(items)}
inverse_items_map = {val:i for i,val in enumerate(items)}
All should result in the same answer since each will label items from 0 to n-items for my user and items vectors (Note, there are actually more categories such as manufacturer, country of origin, color, etc that will also be used in the model).
What I am building at present is a proof of concept, but will be scaled to a DB that has over 1.5MM users, 200k items, and 6MM ratings, so I need to make sure that I am not wasting memory or doing unneeded calculations.