14

I am struggling to use Random Forest in Python with Scikit learn. My problem is that I use it for text classification (in 3 classes - positive/negative/neutral) and the features that I extract are mainly words/unigrams, so I need to convert these to numerical features. I found a way to do it with DictVectorizer's fit_transform:

from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
from sklearn.feature_extraction import DictVectorizer

vec = DictVectorizer(sparse=False)
rf = RandomForestClassifier(n_estimators = 100)
trainFeatures1 = vec.fit_transform(trainFeatures)

# Fit the training data to the training output and create the decision trees
rf = rf.fit(trainFeatures1.toarray(), LabelEncoder().fit_transform(trainLabels))

testFeatures1 = vec.fit_transform(testFeatures)
# Take the same decision trees and run on the test data
Output = rf.score(testFeatures1.toarray(), LabelEncoder().fit_transform(testLabels))

print "accuracy: " + str(Output)

My problem is that the fit_transform method is working on the train dataset, which contains around 8000 instances, but when I try to convert my test set to numerical features too, which is around 80000 instances, I get a memory error saying that:

testFeatures1 = vec.fit_transform(testFeatures)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\dict_vectorizer.py", line 143, in fit_transform
return self.transform(X)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\dict_vectorizer.py", line 251, in transform
Xa = np.zeros((len(X), len(vocab)), dtype=dtype)
MemoryError

What could possibly cause this and is there any workaround? Many thanks!

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
Crista23
  • 3,203
  • 9
  • 47
  • 60
  • Can you try using sparse features? I don't think the toarray() calls should be needed. – Rob Neuhaus Feb 24 '14 at 21:55
  • 2
    scikit-learn's RandomForestClassifier doesn't take sparse matrices as an input. One solution is to split your test set into batches of a certain size, then run predict on each of the smaller batches. – Matt Feb 24 '14 at 23:01
  • @rrenaud I also tried this by creating the vec object as vec = DicVectorizer(). It still didn't help.. – Crista23 Feb 24 '14 at 23:03
  • 1
    @Matt Indeed, that's why I used sparse=False. – Crista23 Feb 24 '14 at 23:04
  • 2
    Another solution is to use `TfIdfVectorizer` followed by a `TruncatedSVD` to reduce the dimensionality of the feature space. – Matt Feb 24 '14 at 23:17
  • 1
    You don't need the `LabelEncoder`. The `y` may contain strings. – Fred Foo Feb 25 '14 at 10:34

1 Answers1

16

You are not supposed to do fit_transform on your test data, but only transform. Otherwise, you will get different vectorization than the one used during training.

For the memory issue, I recommend TfIdfVectorizer, which has numerous options of reducing the dimensionality (by removing rare unigrams etc.).

UPDATE

If the only problem is fitting test data, simply split it to small chunks. Instead of something like

x=vect.transform(test)
eval(x)

you can do

K=10
for i in range(K):
    size=len(test)/K
    x=vect.transform(test[ i*size : (i+1)*size ])
    eval(x)

and record results/stats and analyze them afterwards.

in particular

predictions = []

K=10
for i in range(K):
    size=len(test)/K
    x=vect.transform(test[ i*size : (i+1)*size ])
    predictions += rf.predict(x) # assuming it retuns a list of labels, otherwise - convert it to list

print accuracy_score( predictions, true_labels )
lejlot
  • 64,777
  • 8
  • 131
  • 164