I have some data after using ColumnTransformer() like
>>> X_trans
<197431x6040 sparse matrix of type '<class 'numpy.float64'>'
with 3553758 stored elements in Compressed Sparse Row format>
I transform the data using TruncatedSVD() which seems to work like
from sklearn.decomposition import TruncatedSVD
>>> svd = TruncatedSVD(n_components=3, random_state=0)
>>> X_trans_svd = svd.fit_transform(X_trans)
>>> X_trans_svd
array([[ 1.72326526, 1.85499833, -1.41848742],
[ 1.67802434, 1.81705149, -1.25959756],
[ 1.70251936, 1.82621935, -1.33124505],
...,
[ 1.5607798 , 0.07638707, -1.11972714],
[ 1.56077981, 0.07638652, -1.11972728],
[ 1.91659627, -0.12081577, -0.84551125]])
Now I want to apply the transformed data to DBSCAN like
>>> dbscan = DBSCAN(eps=0.5, min_samples=5)
>>> clusters = dbscan.fit_predict(X_trans_svd)
but my kernel crashes.
I also tried converting it back to a df and apply it to DBSCAN
>>> d = {'1st_component': X_trans_svd[:, 0],
'2nd_component': X_trans_svd[:, 1],
'3rd_component': X_trans_svd[:, 2]}
>>> df = pd.DataFrame(data=d)
>>> dbscan = DBSCAN(eps=0.5, min_samples=5)
>>> clusters = dbscan.fit_predict(df)
But the kernel keeps crashing. Any idea why is that? I'd appreciate a hint.
EDIT: If I use just part of my 3x197431 array it works until X_trans_svd[0:170000]
and starts crashing at X_trans_svd[0:180000]
. Furthermore the size of the array is
>>> X_trans_svd.nbytes
4738344
EDIT2: Sorry for doing this earlier. Here's an example to reproduce. I tried two machines with 16 and 64gb ram. Data is here: original data
import pandas as pd
import numpy as np
from datetime import datetime
from sklearn.cluster import DBSCAN
s = np.loadtxt('data.txt', dtype='float')
elapsed = datetime.now()
dbscan = DBSCAN(eps=0.5, min_samples=5)
clusters = dbscan.fit_predict(s)
elapsed = datetime.now() - elapsed
print(elapsed)