1

I use numpy broadcasting to get the differences matrix from a pandas dataframe. I find when dealing with large dataframe, it reports "'bool' object has no attribute 'sum'" error. While dealing with small dataframe, it runs fine.

I post the two csv files in the following links: large file small file

import numpy as np
import pandas as pd
df_small = pd.read_csv(r'test_small.csv',index_col='Key')
df_small.fillna(0,inplace=True)
a_small = df_small.to_numpy()
matrix = pd.DataFrame((a_small != a_small[:, None]).sum(2), index=df_small.index, columns=df_small.index) 
print(matirx)

when running this, I could get the difference matrix. enter image description here

when switch to large file, It reports the following error. Does anybody know why this happens?

EDIT:The numpy version is 1.19.5

np.__version__
'1.19.5'

enter image description here

hpaulj
  • 221,503
  • 14
  • 230
  • 353
崔箐坡
  • 67
  • 1
  • 4
  • What are the exact shapes for the big and small arrays when trying to do the element-wise comparsion? – Kevin Mar 25 '21 at 02:25
  • The shape of large array is (8599, 3002), the shape of small array is (19, 7) – 崔箐坡 Mar 25 '21 at 02:32
  • Ok, just tried creating a creating an array with ```np.random.randn(8599, 3002)``` and can see that it returns a bool instead of doing the element-wise comparison, which is why it throws an error. If you try and run ```np.not_equal(a_small, a_small[:, None])``` you will get a more transparent error message: numpy.core._exceptions._ArrayMemoryError: Unable to allocate 207. GiB for an array with shape (8599, 8599, 3002) and data type bool. – Kevin Mar 25 '21 at 02:38
  • Thanks for your reply, Kevin. I want to get the difference matrix of large dataframe, do you have good idea? – 崔箐坡 Mar 25 '21 at 02:51

0 Answers0