0

I have a pandas dataframe with the shape of:

df.shape
(1664599, 3935)

that basically looks like:

rws = ["user1","user2","user3","user4","user5","user6","user7","user8"]
cols = ["prod1","prod2","prod3","prod4","prod5"]

np.random.seed(0)
df = pd.DataFrame(np.random.binomial(1, 0.3, size=(len(rws), len(cols))), columns=cols, index=rws)

    prod1 prod2 prod3 prod4 prod5
user1   0   1   0   0   0
user2   0   0   1   1   0
user3   1   0   0   1   0
user4   0   0   1   1   1
user5   1   1   0   1   0
user6   0   0   1   0   0
user7   0   1   0   0   0
user8   0   0   0   1   0

for each user I want to calculate the k nearest neighbors with the jaccard metric.

I tried to achieve this with the following approach:

from sklearn.neighbors import BallTree

ballt = BallTree(df, leaf_size = 30, metric = 'jaccard')
distances, neighbors = ballt.query(df, k=10)
    

But my 60 gb of memory are not enough and my python kernel crashes eventually.

How can I perform this calculation?

rambutan
  • 199
  • 2
  • 10
  • Could you try it with a greater `leaf_size` such as `300`? – yudhiesh Oct 07 '21 at 13:31
  • increaseing leaf_size decreases performance: https://stackoverflow.com/questions/65003877/understanding-leafsize-in-scipy-spatial-kdtree . already tried different sizes, does not change out-of-mem problem – rambutan Oct 07 '21 at 14:39

0 Answers0