0

I am following along with the tutorial here: https://blog.hyperiondev.com/index.php/2019/02/18/machine-learning/

I have the exact same code the author uses, but I will still share it below...

train_data = scipy.io.loadmat('train_32x32.mat')
X = train_data['X']
y = train_data['y']

img_index = 24

X = X.reshape(X.shape[0]*X.shape[1]*X.shape[2],X.shape[3]).T
y = y.reshape(y.shape[0],)
X, y = shuffle(X, y, random_state=42)

clf = RandomForestClassifier(n_estimators=10, n_jobs=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
clf.fit(X_train, y_train) <-----------(MEMORY ERROR)

preds = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test,preds))

The dataset I am using is basically a dictionary of numbers and pictures of numbers. Everytime I get to the line which I pointed out above, I receive a MemoryError. The full error traceback is below:

Traceback (most recent call last):
  File "C:/Users/jack.walsh/Projects/img_recog/main.py", line 22, in <module>
    clf.fit(X_train, y_train)
  File "C:\Users\jack.walsh\AppData\Local\Programs\Python\Python37-32\lib\site-packages\sklearn\ensemble\forest.py", line 249, in fit
    X = check_array(X, accept_sparse="csc", dtype=DTYPE)
  File "C:\Users\jack.walsh\AppData\Local\Programs\Python\Python37-32\lib\site-packages\sklearn\utils\validation.py", line 496, in check_array
    array = np.asarray(array, dtype=dtype, order=order)
  File "C:\Users\jack.walsh\AppData\Local\Programs\Python\Python37-32\lib\site-packages\numpy\core\numeric.py", line 538, in asarray
    return array(a, dtype, copy=False, order=order)
MemoryError

I ran Resource Monitor side-by-side with it and realized my used memory never goes above 30%. Let me know how I can get around this without altering the results!

X.shape = (73257, 3072)

X_train.shape = (51279, 3072)

I have 16GB RAM on this machine.

Jack Walsh
  • 562
  • 4
  • 14

1 Answers1

1

Given that your dataset has 3072 columns (reasonable for images), I simply think that it's too overloaded for a random forest, especially when you have no regularization applied to the classifier. The machine simply don't have enough memory to allocate for such a gigantic model.

Something that I would do in this situation:

  1. Reduce the number of features before training, difficult to do as your data is image and each column is just a pixel value, maybe you can resize your image to be smaller.

  2. Add regularization to your random forest classifier, for example, set max_depth to be smaller or set max_features so that every time when splitting, not all 3072 features are considered. Here's the full list of parameters that you can tune: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

  3. According to Scikit Learn RandomForest Memory Error, setting n_jobs=1 might help as well.

  4. Lastly, I would personally not use random forest for image classifications. I would choose classifiers like SVM or go deep with deep learning models.

TYZ
  • 8,466
  • 5
  • 29
  • 60
  • I probably have to learn more about the different types of classifiers. Resizing the images would reduce the models accuracy in the end, as they are already pretty low-quality images. I get the error even with `n_jobs=1`, `max_depth=1`, and `max_features=10`. Would a different classifier likely handle the memory issues better? – Jack Walsh Jun 06 '19 at 20:29
  • @JackWalsh If you are doing anything related to image, it's really not about the different classifiers but about the image feature itself since each pixel counts and an image have so many pixels. The most popular dataset, MNIST, even have 784 features (28*28) and it's a very small image. In my opinion, it won't make much sense to do image classification with anything other than neural networks, unless you do it in a traditional way where you extract important features from images, i.e. HOG, or others. – TYZ Nov 11 '19 at 19:28