6

I use the pd.pivot_table() method to create a user-item matrix by pivoting the user-item activity data. However, the dataframe is so large that I got complain like this:

Unstacked DataFrame is too big, causing int32 overflow

Any suggestions on solving this problem? Thanks!

r_matrix = df.pivot_table(values='rating', index='userId', columns='movieId')
user4157124
  • 2,809
  • 13
  • 27
  • 42
JoFox
  • 81
  • 1
  • 1
  • 2
  • It would be very helpful as well, if you could provide a [MWE](https://stackoverflow.com/help/minimal-reproducible-example) – Qw3ry Jun 27 '19 at 11:57
  • Are you aware what an `int32`-overflow is? It looks like you have to use another data structure if you want to handle data of that size. – Qw3ry Jun 27 '19 at 13:46
  • not really tbh. do you have any suggestions? would numpy be better than pandas? – JoFox Jun 27 '19 at 14:39
  • 1
    So: An overflow occurs, if you put something in a container, but this "something" is too big for the container. In your case, something inside pandas tries to write a number into a variable, but the number is too large to fit into the variable (which is an `int32` in this case. For further information on (integer) overflows, just google a bit. – Qw3ry Jun 27 '19 at 15:23

5 Answers5

2

Some Solutions:

  • You can downgrade your pandas version to 0.21 which is no problem with pivot table with big size datas.
  • You can set your data to dictionary format like df.groupby('EVENT_ID')['DIAGNOSIS'].apply(list).to_dict()
dasmehdix
  • 329
  • 3
  • 13
2

You can use groupby instead. Try this code:

reviews.groupby(['userId','movieId'])['rating'].max().unstack()
Hamid
  • 612
  • 1
  • 8
  • 20
1

An integer overflow inside library code is nothing you can do much about. You have basically three options:

  1. Change the input data you provide to the library so the overflow does not occur. You probably need to make the input smaller in some sense. If that does not help, you may be using the library in a wrong way or hit a bug in the library.
  2. Use a different library (or none at all); it seems that the library you are using is not intended to operate on large input.
  3. Modify the code of the library itself so it can handle your input. This may be hard to do, but if you submit a pull request to the library source code, many people will profit from it.

You don't provide much code, so I cannot tell what is the best solution for you.

Qw3ry
  • 1,319
  • 15
  • 31
1

If you want movieId as your columns, first sort the dataframe using movieId as the key.

Then divide (half) the dataframe such that each subset contains all the ratings for a particular movie.

subset1 = df[:n] 
subset2 = df[n:]

Now, apply to each of the subsets

matrix1 = subset1.pivot_table(values='rating', index='userId', columns='movieId')
matrix2 = subset2.pivot_table(values='rating', index='userId', columns='movieId')

Finally join matrix1 and matrix2 using,

complete_matrix = matrix1.join(matrix2)

On the other hand, if you want userId as your columns, sort the dataframe using userId as the key and repeat the above process.

***Please be sure to delete subset1, subset2, matrix1 & matrix2 after you're done or else you'll end up with Memory Error.

Uchiha012
  • 821
  • 5
  • 9
  • was stuck with the same issue - this is the only way to make it work – Rajarshi Ghosh Mar 29 '22 at 16:35
  • @RajarshiGhosh Hi, I'm using this method now, but whether in kaggle or google colab(GPU), they all don't work. Kaggle shows "Your notebook tried to allocate more memory than is available. It has restarted." Google colab shows "Your session crashed after using all available RAM." Did you have the same problem? – resssslll Nov 20 '22 at 05:54
0

Converting the values column should resolve your issue:

df[‘ratings’] = df[‘ratings’].astype(‘int64’)

Infamouse
  • 21
  • 6