6

I have a pandas DataFrame with 86k rows, 5 features and 1 target column. I'm trying to train a DecisionTreeClassifier using 70% of the DataFrame as train data, and I get a MemoryError from the fit method. I've tried changing some of the parameters but I don't really know what's causing the error so I don't know how to handle it. I'm on Windows 10 with 8GB of RAM.

Code

train, test = train_test_split(data, test_size = 0.3)
X_train = train.iloc[:, 1:-1] # first column is not a feature
y_train = train.iloc[:, -1]
X_test = test.iloc[:, 1:-1]
y_test = test.iloc[:, -1]

DT = DecisionTreeClassifier()
DT.fit(X_train, y_train)
dt_predictions = DT.predict(X_test)

Error

File (...), line 97, in <module>
DT.fit(X_train, y_train)
File "(...)\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\tree\tree.py", line 790, in fit
X_idx_sorted=X_idx_sorted)
File "(...)\AppData\Local\Programs\Python\Python36-32\lib\site-packages\sklearn\tree\tree.py", line 362, in fit
builder.build(self.tree_, X, y, sample_weight, X_idx_sorted)
File "sklearn\trewe\_tree.pyx", line 145, in sklearn.tree._tree.DepthFirstTreeBuilder.build
File "sklearn\tree\_tree.pyx", line 244, in sklearn.tree._tree.DepthFirstTreeBuilder.build
File "sklearn\tree\_tree.pyx", line 735, in sklearn.tree._tree.Tree._add_node
File "sklearn\tree\_tree.pyx", line 707, in sklearn.tree._tree.Tree._resize_c
File "sklearn\tree\_utils.pyx", line 39, in sklearn.tree._utils.safe_realloc
MemoryError: could not allocate 671612928 bytes

Same error happens when I try the RandomForestClassifier, always in the line that does the fitting. How can I solve this?

julia
  • 153
  • 1
  • 10
  • Just to satisfy my curiosity, could you try `y_train = train.iloc[:, -1:]` (adding the colon to the end) so that your Y values are shape (n, 1) rather than just (n,). I don't think that's causing the issue, but I know I've seen sklearn warnings about that before – scnerd Jun 21 '18 at 18:44
  • If you open task manager before/during running this code, do you have a few GB's of memory to spare, or are other processes perhaps consuming it all? – scnerd Jun 21 '18 at 18:46
  • I did that to both y_train and y_test and it doesn't really change anything, I get the same error :/ – julia Jun 21 '18 at 18:49
  • I have about 2.7GB of memory available before running, during execution it goes down to 2.3GB minimum up until I get the MemoryError. – julia Jun 21 '18 at 18:52
  • 1
    2.7G might not be enough. That error alone indicates that one chunk of memory being allocated is ~670M, and is a `realloc` call, which means it might be trying to size up an existing allocated block of similar size... so possibly 1-1.5G for just that one thing it's allocating. You data seems very reasonably sized, but that decision tree might just need more memory. Try killing some memory-hogging processes and/or rebooting, then trying again. Chrome, for example, can take a huge amount of memory if you have a lot of tabs open. 8G should be enough for this problem, though. – scnerd Jun 21 '18 at 18:55
  • I killed every process I could and 3.7GB available still wasn't enough... I'll try it on another computer with more RAM then. Thank you – julia Jun 21 '18 at 19:02
  • I have same issue , with TFIDFVectorizer. I tried changing n_gram as mentioned in other solutions but doesn't work. I think it's bug with library. They should add code to handle memory efficienly.i am just working on 1800 rows of sentences and have 12GB RAM – Morse Jun 21 '18 at 19:24
  • Can you show the output of `X_train.dtype`, `X_train.shape`, `Y_train.shape` and `numpy.unique(Y_train)` just before you do `.fit()`? 86k rows and 5 features is nothing, so this should not take so much RAM. – Jon Nordby Jun 23 '18 at 19:17

1 Answers1

2

I've been running into the same issue. Be sure you're dealing with a Classification problem and not a Regression problem. If your target column is continuous, you might want to use http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html instead of RandomForestClassifier.

Teuszie
  • 74
  • 1
  • 6