4

I am building a recommender system, and as such I have lists of users, items, and ratings. As such, I need to assign each user and item a categorical ID. There are approximately 100,000 users and 10,000 items, with roughly 1 million ratings. My question is which method is the most scalable?

I think I have 3 options:

  1. Using sklearn's preprocessing.LabelEncoder()
  2. Using pandas df['items'].astype('category').cat.codes.values
  3. Using something like a dictionary that I can write back to the dataframe

such as

items = item_reviews.items.unique()
items_map = {i:val for i,val in enumerate(items)}
inverse_items_map = {val:i for i,val in enumerate(items)}

All should result in the same answer since each will label items from 0 to n-items for my user and items vectors (Note, there are actually more categories such as manufacturer, country of origin, color, etc that will also be used in the model).

What I am building at present is a proof of concept, but will be scaled to a DB that has over 1.5MM users, 200k items, and 6MM ratings, so I need to make sure that I am not wasting memory or doing unneeded calculations.

user1563247
  • 516
  • 1
  • 7
  • 20

1 Answers1

2

I think pandas category is your best option since it use hash table, check https://stackoverflow.com/a/39503973/4633341 for some time tests.