0

I have a table with 140 million records

I have aggregated data My data look like this

filename code frequency
1054968  A837 3
1054968  F939 2
9899223  W821 8
3775859  A837 5
..
..
..

I want to Pivot the data to look like this

filename  A837  ...  F939 ...  W821 ...
1054968    3          2         0
9899223    0          0         8
3775859    5          0         0

I use this method

df_pivot = df_features.pivot(index='filename', columns='code', values='frequency')

It works fine with around 100,000 records but when reach 1M

I get this error

    "Unstacked DataFrame is too big, " "causing int32 overflow"

ValueError: Unstacked DataFrame is too big, causing int32 overflow

How can I do that pivot? (total number of columns after pivot should be around 36,000)

asmgx
  • 7,328
  • 15
  • 82
  • 143
  • 1
    There are some answers to this question already [here](https://stackoverflow.com/questions/56790261/pandas-int32-overflow-cant-bulid-a-pivot-table). One option (which falls under the "use a different library" category is https://dask.org/. – Chris Apr 09 '20 at 07:19
  • pandas is little difficult to handle this big dataset.. you can try Dask or Datatable from H2O.ai – Subbu VidyaSekar Apr 09 '20 at 07:29

0 Answers0