4

I have some data after using ColumnTransformer() like

>>> X_trans
<197431x6040 sparse matrix of type '<class 'numpy.float64'>'
with 3553758 stored elements in Compressed Sparse Row format>

I transform the data using TruncatedSVD() which seems to work like

from sklearn.decomposition import TruncatedSVD

>>> svd = TruncatedSVD(n_components=3, random_state=0)
>>> X_trans_svd = svd.fit_transform(X_trans)
>>> X_trans_svd
array([[ 1.72326526,  1.85499833, -1.41848742],
   [ 1.67802434,  1.81705149, -1.25959756],
   [ 1.70251936,  1.82621935, -1.33124505],
   ...,
   [ 1.5607798 ,  0.07638707, -1.11972714],
   [ 1.56077981,  0.07638652, -1.11972728],
   [ 1.91659627, -0.12081577, -0.84551125]])

Now I want to apply the transformed data to DBSCAN like

>>> dbscan = DBSCAN(eps=0.5, min_samples=5)
>>> clusters = dbscan.fit_predict(X_trans_svd)

but my kernel crashes.

I also tried converting it back to a df and apply it to DBSCAN

>>> d = {'1st_component': X_trans_svd[:, 0],
         '2nd_component': X_trans_svd[:, 1],
         '3rd_component': X_trans_svd[:, 2]}

>>> df = pd.DataFrame(data=d)

>>> dbscan = DBSCAN(eps=0.5, min_samples=5)
>>> clusters = dbscan.fit_predict(df)

But the kernel keeps crashing. Any idea why is that? I'd appreciate a hint.

EDIT: If I use just part of my 3x197431 array it works until X_trans_svd[0:170000] and starts crashing at X_trans_svd[0:180000]. Furthermore the size of the array is

>>> X_trans_svd.nbytes
4738344

EDIT2: Sorry for doing this earlier. Here's an example to reproduce. I tried two machines with 16 and 64gb ram. Data is here: original data

import pandas as pd
import numpy as np
from datetime import datetime
from sklearn.cluster import DBSCAN

s = np.loadtxt('data.txt', dtype='float') 

elapsed = datetime.now()
dbscan = DBSCAN(eps=0.5, min_samples=5)
clusters = dbscan.fit_predict(s)
elapsed = datetime.now() - elapsed
print(elapsed) 
AndreasInfo
  • 1,062
  • 11
  • 26
  • 2
    memory problem. Try to run it from the console. It might work. – seralouk May 06 '21 at 09:39
  • Thanks, but unfortunately it does not make any difference. It's getting killed without further information. Anyway, X_trans_svd consists of 3x197431 values, that should not be to much, no? Any other idea? – AndreasInfo May 06 '21 at 11:14
  • can you share the full code and data? – seralouk May 06 '21 at 12:30
  • I updated the post to reproduce my problem. Thanks for your effort. – AndreasInfo May 06 '21 at 12:50
  • I can run the example without an issue : https://ibb.co/jVdXcZg. I use python3, also worked with python2. – seralouk May 06 '21 at 12:59
  • Sorry, you probably already saw the typo. My example works as well, but with the original data it doesn't. What's a good platform to share (probably export the array to .txt?) – AndreasInfo May 06 '21 at 15:57
  • https://easyupload.io/ – seralouk May 06 '21 at 16:02
  • I really appreciate your help. I updated the question and I hope it breaks on your machine now as well :). – AndreasInfo May 06 '21 at 16:21
  • https://ibb.co/7pbtS4X It still works. I suggest to restart the kernel. If does not resolve, re-install the kernel. – seralouk May 06 '21 at 17:12
  • I am desperate. I tried reinstalling kernel. I tried python2. I tried from the console. I tried 2 different machines. It works with dummy data in seconds but with my real data it keeps crashing. I have absolutely no clue what else to do. – AndreasInfo May 07 '21 at 08:10
  • Ok. Try something. Save the code in a `test.py` file, open terminal and run `python3 test.py`. This should work – seralouk May 07 '21 at 08:35
  • Still crashing. All files are here: https://easyupload.io/m/l9cr3x Switching data in line 41 to pca or svd results in killed. – AndreasInfo May 07 '21 at 08:55

0 Answers0