I have a pandas dataframe with the shape of:
df.shape
(1664599, 3935)
that basically looks like:
rws = ["user1","user2","user3","user4","user5","user6","user7","user8"]
cols = ["prod1","prod2","prod3","prod4","prod5"]
np.random.seed(0)
df = pd.DataFrame(np.random.binomial(1, 0.3, size=(len(rws), len(cols))), columns=cols, index=rws)
prod1 prod2 prod3 prod4 prod5
user1 0 1 0 0 0
user2 0 0 1 1 0
user3 1 0 0 1 0
user4 0 0 1 1 1
user5 1 1 0 1 0
user6 0 0 1 0 0
user7 0 1 0 0 0
user8 0 0 0 1 0
for each user I want to calculate the k nearest neighbors with the jaccard metric.
I tried to achieve this with the following approach:
from sklearn.neighbors import BallTree
ballt = BallTree(df, leaf_size = 30, metric = 'jaccard')
distances, neighbors = ballt.query(df, k=10)
But my 60 gb of memory are not enough and my python kernel crashes eventually.
How can I perform this calculation?