kernel dies when computing DBSCAN in scikit-learn after dimensionality reduction

Question

I have some data after using ColumnTransformer() like

>>> X_trans
<197431x6040 sparse matrix of type '<class 'numpy.float64'>'
with 3553758 stored elements in Compressed Sparse Row format>

I transform the data using TruncatedSVD() which seems to work like

from sklearn.decomposition import TruncatedSVD

>>> svd = TruncatedSVD(n_components=3, random_state=0)
>>> X_trans_svd = svd.fit_transform(X_trans)
>>> X_trans_svd
array([[ 1.72326526,  1.85499833, -1.41848742],
   [ 1.67802434,  1.81705149, -1.25959756],
   [ 1.70251936,  1.82621935, -1.33124505],
   ...,
   [ 1.5607798 ,  0.07638707, -1.11972714],
   [ 1.56077981,  0.07638652, -1.11972728],
   [ 1.91659627, -0.12081577, -0.84551125]])

Now I want to apply the transformed data to DBSCAN like

>>> dbscan = DBSCAN(eps=0.5, min_samples=5)
>>> clusters = dbscan.fit_predict(X_trans_svd)

but my kernel crashes.

I also tried converting it back to a df and apply it to DBSCAN

>>> d = {'1st_component': X_trans_svd[:, 0],
         '2nd_component': X_trans_svd[:, 1],
         '3rd_component': X_trans_svd[:, 2]}

>>> df = pd.DataFrame(data=d)

>>> dbscan = DBSCAN(eps=0.5, min_samples=5)
>>> clusters = dbscan.fit_predict(df)

But the kernel keeps crashing. Any idea why is that? I'd appreciate a hint.

EDIT: If I use just part of my 3x197431 array it works until X_trans_svd[0:170000] and starts crashing at X_trans_svd[0:180000]. Furthermore the size of the array is

>>> X_trans_svd.nbytes
4738344

EDIT2: Sorry for doing this earlier. Here's an example to reproduce. I tried two machines with 16 and 64gb ram. Data is here: original data

import pandas as pd
import numpy as np
from datetime import datetime
from sklearn.cluster import DBSCAN

s = np.loadtxt('data.txt', dtype='float') 

elapsed = datetime.now()
dbscan = DBSCAN(eps=0.5, min_samples=5)
clusters = dbscan.fit_predict(s)
elapsed = datetime.now() - elapsed
print(elapsed)

memory problem. Try to run it from the console. It might work. — seralouk, May 06 '21 at 09:39
Thanks, but unfortunately it does not make any difference. It's getting killed without further information. Anyway, X_trans_svd consists of 3x197431 values, that should not be to much, no? Any other idea? — AndreasInfo, May 06 '21 at 11:14
I updated the post to reproduce my problem. Thanks for your effort. — AndreasInfo, May 06 '21 at 12:50
I can run the example without an issue : https://ibb.co/jVdXcZg. I use python3, also worked with python2. — seralouk, May 06 '21 at 12:59
Sorry, you probably already saw the typo. My example works as well, but with the original data it doesn't. What's a good platform to share (probably export the array to .txt?) — AndreasInfo, May 06 '21 at 15:57
I really appreciate your help. I updated the question and I hope it breaks on your machine now as well :). — AndreasInfo, May 06 '21 at 16:21
https://ibb.co/7pbtS4X It still works. I suggest to restart the kernel. If does not resolve, re-install the kernel. — seralouk, May 06 '21 at 17:12
I am desperate. I tried reinstalling kernel. I tried python2. I tried from the console. I tried 2 different machines. It works with dummy data in seconds but with my real data it keeps crashing. I have absolutely no clue what else to do. — AndreasInfo, May 07 '21 at 08:10
Ok. Try something. Save the code in a `test.py` file, open terminal and run `python3 test.py`. This should work — seralouk, May 07 '21 at 08:35
Still crashing. All files are here: https://easyupload.io/m/l9cr3x Switching data in line 41 to pca or svd results in killed. — AndreasInfo, May 07 '21 at 08:55

kernel dies when computing DBSCAN in scikit-learn after dimensionality reduction

0 Answers0