Accuracy with TF-IDF and non-TF-IDF features

Question

I run a Random Forest algorithm with TF-IDF and non-TF-IDF features.

In total the features are around 130k in number (after a feature selection conducted on the TF-IDF features) and the observations of the training set are around 120k in number.

Around 500 of them are the non-TF-IDF features.

The issue is that the accuracy of the Random Forest on the same test set etc with

- only the non-TF-IDF features is 87%

- the TF-IDF and non-TF-IDF features is 76%

This significant aggravation of the accuracy raises some questions in my mind.

The relevant piece of code of mine with the training of the models is the following:

drop_columns = ['labels', 'complete_text_1', 'complete_text_2']

# Split to predictors and targets
X_train = df.drop(columns=drop_columns).values
y_train = df['labels'].values


# Instantiate, train and transform with tf-idf models
vectorizer_1 = TfidfVectorizer(analyzer="word", ngram_range=(1,2), vocabulary=tf_idf_feature_names_selected)
X_train_tf_idf_1 = vectorizer_1.fit_transform(df['complete_text_1'])

vectorizer_2 = TfidfVectorizer(analyzer="word", ngram_range=(1,2), vocabulary=tf_idf_feature_names_selected)
X_train_tf_idf_2 = vectorizer_2.fit_transform(df['complete_text_2'])


# Covert the general features to sparse array
X_train = np.array(X_train, dtype=float)
X_train = csr_matrix(X_train)


# Concatenate the general features and tf-idf features array
X_train_all = hstack([X_train, X_train_tf_idf_1, X_train_tf_idf_2])


# Instantiate and train the model
rf_classifier = RandomForestClassifier(n_estimators=150, random_state=0, class_weight='balanced', n_jobs=os.cpu_count()-1)
rf_classifier.fit(X_train_all, y_train)

Personally, I have not seen any bug in my code (this piece above and in general).

The hypothesis which I have formulated to explain this decrease in accuracy is the following.

The number of non-TF-IDF features is only 500 (out of the 130k features in total)
This gives some chances that the non-TF-IDF features are not picked that much at each split by the trees of the random forest (eg because of max_features etc)
So if the non-TF-IDF features do actually matter then this will create problems because they are not taken enough into account.

Related to this, when I check the features' importances of the random forest after training it I see the importances of the non-TF-IDF features being very very low (although I am not sure how reliable indicator are the feature importances especially with TF-IDF features included).

Can you explain differently the decrease in accuracy at my classifier?

In any case, what would you suggest doing?

Some other ideas of combining the TF-IDF and non-TF-IDF features are the following.

One option would be to have two separate (random forest) models - one for the TF-IDF features and one for the non-TF-IDF features. Then the results of these two models will be combined either by (weighted) voting or meta-classification.

Would be really helpful here to provide more information about your task, such as: what is the task, loss function, number of datapoints you use, etc. — Alexander Pivovarov, Jun 11 '20 at 05:18
I mean from the code it seems that you have a binary classification task on text documents with cross-entropy loss, but spelling it explicitly would make it a bit simpler to read your question. And the number of datapoints is extremely important too of course (don't see that mentioned anywhere). — Alexander Pivovarov, Jun 11 '20 at 05:27
@AlexanderPivovarov, good points. It is a bit too time consuming to provide all this information here and this is why I have not thus far. For start, the observations of the training set are around 120k in number. — Outcast, Jun 11 '20 at 10:51
If you look at the documentation for `max_features`, it will use `sqrt(n_features)` by default, which is about 360 features any given tree will see. Even if there's no overlap in those features between different trees, 150*360 = 54k. So most of your 130k features will _never be seen_ by the model. — Swier, Jun 17 '20 at 12:25
@Swier, sure I agree about `max_features` in general and this is why I refer to this at my post too. However, unless I am missing something, keep in mind that a new set of features is chosen every time **at each split** and **not only at each tree** based on the original paper of the random forest but also based on the SkLearn documentation for the `RandomForestClassifier` (`max_features{“auto”, “sqrt”, “log2”}, int or float, default=”auto” The number of features to consider when looking for the best split:`). — Outcast, Jun 17 '20 at 12:47
@Outcast It looks like you're right, I misinterpreted the SKLearn documentation. The documentation suggests that a new sample of features is considered for each split, and indeed, it seem to be [implemented](https://github.com/scikit-learn/scikit-learn/blob/fd237278e895b42abe8d8d09105cbb82dc2cbba7/sklearn/tree/_splitter.pyx#L334) like that as well. Thanks for pointing it out to me! — Swier, Jun 17 '20 at 14:22
Sure @Swier. This means that quite a lot more features are taken into account but again probably quite some are left out in the end or not taken enough into account. ;) — Outcast, Jun 17 '20 at 14:45
I'd suggest experimenting with different parameters. What happens if you use, say, 300 trees? Or change the max_features to a larger number? I wouldn't be surprised if one of these gave a higher accuracy. — Timothy Smith, Jun 17 '20 at 15:30
@TimothySmith, but I am also open to the scenario that it not about the hypothesis which I stated at my post - however it would be good to come up with a new one then ;) — Outcast, Jun 17 '20 at 17:20

score 2 · Accepted Answer · answered Jun 11 '20 at 05:15

2

Your view that 130K of features is way too much for the Random forest sounds right. You didn't mention how many examples you have in your dataset and that would be cruccial to the choice of the possible next steps. Here are a few ideas on top of my head.

If number of datapoints is large enough you myabe want to train some transformation for the TF-IDF features - e.g. you might want to train a small-dimensional embeddings of these TF-IDF features into, say 64-dimensional space and then e.g. a small NN on top of that (even a linear model maybe). After you have embeddings you could use them as transforms to generate 64 additional features for each example to replace TF-IDF features for RandomForest training. Or alternatively just replace the whole random forest with a NN of such architecture that e.g. TF-IDFs are all combined into a few neurons via fully-connected layers and later concatened with other features (pretty much same as embeddings but as a part of NN).

If you don't have enough data to train a large NN maybe you can try to train GBDT ensemble instead of random forest. It probably should do much better job at picking the good features compared to random forest which definitely likely to be affected a lot by a lot of noisy useless features. Also you can first train some crude version and then do a feature selection based on that (again, I would expect it should do a more reasonable job compared to random forest).

answered Jun 11 '20 at 05:15

Alexander Pivovarov

4,850
1
11
34

Thank your for your answer :) Just to clarify for start, I did not exactly say "130K of features is way too much for the Random forest". In the sense that if I had 130k fatures all being TF-IDF features then actually it would be rather fine I think. The problem is more I think that there are 130k features out of which 500 of them (the non-TF-IDF ones) are probably pretty important (as it shows when I use only them). – Outcast Jun 11 '20 at 12:51
Ok, maybe my wording is a bit too strong here. But random forests (and other tree ensemble algorithms like GBDT) are difficult to use with sparse features like bag-of-words (and TF-IDF is just a nice weight to set the weights for bag-of-words features here). And of course in your example multiple tweaks could be imagined to help Random Forest to do better - e.g. obviously with this example if you could weight your features for Random Forests training that would likely help avoiding accuracy loss in your example, but sklearn Random Forest doesn't support such a thing as far as I know. – Alexander Pivovarov Jun 11 '20 at 14:35
If you really want something very simple to help RandomForest training here I would suggest to do 2 step training: (first step - feature selection) train a much larger forest using all features you have, then get any sort of feature importance output from that training (sklearn's feature importance output will probably be ok) and prune only TF-IDF features based on that (e.g. use all your non-TF-IDF features + best `M` TF-IDF features based on feature importance), then (second step) train a regular size ensemble using pruned feature set. – Alexander Pivovarov Jun 11 '20 at 14:38
Sure this option is certainly valid but actually this is what I already have done - the 130k features are after the feature selection (based on features importances) and specifically they are about the top 1% features of the total TF-IDF features. I could do be even more "selective" I suppose but I am not sure that it is necessarily the best option of accuracy. Another option is the one I describe at the last paragraph of my post. – Outcast Jun 11 '20 at 16:01
Lastly, "But random forests (and other tree ensemble algorithms like GBDT) are difficult to use with sparse features like bag-of-words", do you have some sources for this claim of yours? I have not really heard it much although there are some cases (eg https://stats.stackexchange.com/a/228786/193309) but there are also the opposite claims too (eg https://stats.stackexchange.com/a/47467/193309). Also, I have tested in other cases myself the random forest on TF-IDF high dimensional data (without non-TF-IDF) and it went very well and better than other models generally speaking. – Outcast Jun 11 '20 at 16:04
In this regard, if not random forest then which? :) For example, SVM or XGBoost? I will be surprised if they will make such a difference but I may be wrong. ;) – Outcast Jun 11 '20 at 16:13
I don't have any sources backing my words, just speaking of experience and common sense. Imagine you have 150 trees (taking that from your code sample) and, say, even 100 splits in one tree. That gives you 15K splits in total in the whole ensemble. There is no way 130K features can be utilized fully in such cases. And if each feature of 130K carries information, then some of it will inevitably be lost. Even more difficult to learn any interactions between them. – Alexander Pivovarov Jun 11 '20 at 16:14
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/215757/discussion-between-alexander-pivovarov-and-outcast). – Alexander Pivovarov Jun 11 '20 at 16:15
In general RandomForests and tree based approaches are contraindicated for what looks to be a text mining/analysis sort of problem. – DejaVuSansMono Jun 11 '20 at 22:53
Yeah, I agree with that. Embeddings with sparse optimizer sounds like a more appropriate choice to me. – Alexander Pivovarov Jun 11 '20 at 23:00

score 0 · Answer 2 · answered Jun 17 '20 at 17:56

My guess is that your hypothesis is partly correct.

When using the full dataset (in the 130K feature model), each split in the tree uses only a small fraction of the 500 non-TF-IDF features. So if the non-TF-IDF features are important, then each split misses out on a lot of useful data. The data that is ignored for one split will probably be used for a different split in the tree, but the result isn't as good as it would be when more of the data is used at every split.

I would argue that there are some very important TF-IDF features, too. The fact that we have so many features means that a small fraction of those features is considered at each split.

In other words: the problem isn't that we're weakening the non-TF-IDF features. The problem is that we're weakening all of the useful features (both non-TF-IDF and TF-IDF). This is along the lines of Alexander's answer.

In light of this, your proposed solutions won't solve the problem very well. If you make two random forest models, one with 500 non-TF-IDF features and the other with 125K TF-IDF features, the second model will perform poorly, and negatively influence the results. If you pass the results of the 500 model as an additional feature to the 125K model, you're still underperforming.

If we want to stick with random forests, a better solution would be to increase the max_features and/or the number of trees. This will increase the odds that useful features are considered at each split, leading to a more accurate model.

I may agree almost entirely with this `The problem is that we're weakening all of the useful features (both non-TF-IDF and TF-IDF).` but the issue for which I try to find an answer at my post is this: `The issue is that the accuracy of the Random Forest on the same test set etc with only the non-TF-IDF features is 87% and with the TF-IDF and non-TF-IDF features is 76%`, hence my answer on the non-TF-IDF features. — Outcast, Jun 17 '20 at 18:34

Accuracy with TF-IDF and non-TF-IDF features

2 Answers2