LabelEncoder vs. Pandas categorical vs. enumerate?

Question

I am building a recommender system, and as such I have lists of users, items, and ratings. As such, I need to assign each user and item a categorical ID. There are approximately 100,000 users and 10,000 items, with roughly 1 million ratings. My question is which method is the most scalable?

I think I have 3 options:

Using sklearn's preprocessing.LabelEncoder()
Using pandas df['items'].astype('category').cat.codes.values
Using something like a dictionary that I can write back to the dataframe

such as

items = item_reviews.items.unique()
items_map = {i:val for i,val in enumerate(items)}
inverse_items_map = {val:i for i,val in enumerate(items)}

All should result in the same answer since each will label items from 0 to n-items for my user and items vectors (Note, there are actually more categories such as manufacturer, country of origin, color, etc that will also be used in the model).

What I am building at present is a proof of concept, but will be scaled to a DB that has over 1.5MM users, 200k items, and 6MM ratings, so I need to make sure that I am not wasting memory or doing unneeded calculations.

score 2 · Accepted Answer · answered Dec 14 '18 at 22:28

2

I think pandas category is your best option since it use hash table, check https://stackoverflow.com/a/39503973/4633341 for some time tests.

answered Dec 14 '18 at 22:28

ignacio pacheco

94
5

LabelEncoder vs. Pandas categorical vs. enumerate?

1 Answers1