Check if user has seen the film in MovieLens dataframe

Question

I'm trying to create a dataframe where for each user there will be a list of all movies with binary classification, whether the user has seen this movie or not. RN I'm using MovieLens small (ZIP) and trying to get result like this:

   1  2  3  4  5  6  etc
1  0  0  1  1  0  1  etc
2  1  0  0  0  1  1  etc
3  0  1  1  0  0  1  etc

Original DataFrames are like:

    movieId  title 
0   1        Toy Story (1995)    
1   2        Jumanji (1995)
2   3        Grumpier Old Men (1995)    
3   4        Waiting to Exhale (1995)

    userId  movieId     rating  timestamp
0   1       1           4.0     964982703
1   1       3           4.0     964981247
2   1       6           4.0     964982224
3   1       47          5.0     964983815
4   1       50          5.0     964982931

Where indexes are user IDs and columns are movie IDs. I tried solving this problem using list comprehensions like this:

pd.DataFrame(data=[[1 if movie_id in ratings_df[ratings_df["userId"] == user_id]["movieId"] else 0 for movie_id in tqdm(range(1, last_movie + 1))] for user_id in range(1, last_user + 1)], columns=movie_columns)

But this is working way too slow.

Without a sample of your original dataframe it's hard to get help. — Corralien, Aug 23 '21 at 12:15
Thx, added samples. Btw full frames can be found [here](https://grouplens.org/datasets/movielens/) — OverFitter, Aug 23 '21 at 12:22

Corralien · Accepted Answer · 2021-08-23T13:02:43.483

Use pd.crosstab:

For this sample:

       userId  movieId  rating   timestamp
85255     554     1947     5.0   944900238
21229     140     1262     5.0   942841394
40791     275     2580     4.0  1049078929
73989     474     3135     4.0  1115125983
73989     474     3135     4.0  1115125983  # dupe
22434     153      520     2.0  1525552333
22434     153      520     2.0  1525552333  # dupe

You can use:

>>> pd.crosstab(df['userId'], df['movieId']).astype(bool).astype(int)

movieId  520   1262  1947  2580  3135
userId
140         0     1     0     0     0
153         1     0     0     0     0
275         0     0     0     1     0
474         0     0     0     0     1
554         0     0     1     0     0

Just for information on the entire small dataset ([100836 rows x 4 columns])

# Solution proposed from duplicated link
>>> %timeit pd.get_dummies(df.set_index('userId')['movieId'])
9.73 s ± 158 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Proposed solution here
>>> %timeit pd.crosstab(df['userId'], df['movieId']).astype(bool).astype(int)
5.95 s ± 307 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Well, it works, but for some reason only for 64 users – OverFitter Aug 23 '21 at 12:41 — OverFitter, Aug 23 '21 at 12:41
@Borka. I found a first workaround. – Corralien Aug 23 '21 at 12:43 — Corralien, Aug 23 '21 at 12:43

Check if user has seen the film in MovieLens dataframe

1 Answers1