0

I'm trying to create a dataframe where for each user there will be a list of all movies with binary classification, whether the user has seen this movie or not. RN I'm using MovieLens small (ZIP) and trying to get result like this:

   1  2  3  4  5  6  etc
1  0  0  1  1  0  1  etc
2  1  0  0  0  1  1  etc
3  0  1  1  0  0  1  etc

Original DataFrames are like:

    movieId  title 
0   1        Toy Story (1995)    
1   2        Jumanji (1995)
2   3        Grumpier Old Men (1995)    
3   4        Waiting to Exhale (1995)   
    userId  movieId     rating  timestamp
0   1       1           4.0     964982703
1   1       3           4.0     964981247
2   1       6           4.0     964982224
3   1       47          5.0     964983815
4   1       50          5.0     964982931

Where indexes are user IDs and columns are movie IDs. I tried solving this problem using list comprehensions like this:

pd.DataFrame(data=[[1 if movie_id in ratings_df[ratings_df["userId"] == user_id]["movieId"] else 0 for movie_id in tqdm(range(1, last_movie + 1))] for user_id in range(1, last_user + 1)], columns=movie_columns) 

But this is working way too slow.

Corralien
  • 109,409
  • 8
  • 28
  • 52
OverFitter
  • 49
  • 8

1 Answers1

1

Use pd.crosstab:

For this sample:

       userId  movieId  rating   timestamp
85255     554     1947     5.0   944900238
21229     140     1262     5.0   942841394
40791     275     2580     4.0  1049078929
73989     474     3135     4.0  1115125983
73989     474     3135     4.0  1115125983  # dupe
22434     153      520     2.0  1525552333
22434     153      520     2.0  1525552333  # dupe

You can use:

>>> pd.crosstab(df['userId'], df['movieId']).astype(bool).astype(int)

movieId  520   1262  1947  2580  3135
userId
140         0     1     0     0     0
153         1     0     0     0     0
275         0     0     0     1     0
474         0     0     0     0     1
554         0     0     1     0     0

Just for information on the entire small dataset ([100836 rows x 4 columns])

# Solution proposed from duplicated link
>>> %timeit pd.get_dummies(df.set_index('userId')['movieId'])
9.73 s ± 158 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Proposed solution here
>>> %timeit pd.crosstab(df['userId'], df['movieId']).astype(bool).astype(int)
5.95 s ± 307 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Corralien
  • 109,409
  • 8
  • 28
  • 52