How to shrink a bag-of-words model?

Question

The question title says it all: How can I make a bag-of-words model smaller? I use a Random Forest and a bag-of-words feature set. My model reaches 30 GB in size and I am sure that most words in the feature set do not contribute to the overall performance.

How to shrink a big bag-of-words model without losing (too much) performance?

amdex · Accepted Answer · 2019-10-02T09:27:03.557

Use feature selection. Feature selection removes features from your dataset based on their distribution with regards to your labels, using some scoring function.

Features that rarely occur, or occur randomly with all your labels, for example, are very unlikely to contribute to accurate classification, and get low scores.

Here's an example using sklearn:

from sklearn.feature_selection import SelectPercentile

# Assume some matrix X and labels y
# 10 means only include the 10% best features
selector = SelectPercentile(percentile=10)

# A feature space with only 10% of the features
X_new = selector.fit_transform(X, y)

# See the scores for all features
selector.scores

As always, be sure to only call fit_transform on your training data. When using dev or test data, only use transform. See here for additional documentation.

Note that there is also a SelectKBest, which does the same, but which allows you to specify an absolute number of features to keep, instead of a percentage.

score 1 · Answer 2 · answered Oct 02 '19 at 08:42

If you don't want to change the architecture of your neural network and you are only trying to reduce the memory footprint, a tweak that can be made is to reduce the terms annotated by the CountVectorizer. From the scikit-learn documentation, we have (at least) three parameter for reduce the vocabulary size.

max_df : float in range [0.0, 1.0] or int, default=1.0

When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

min_df : float in range [0.0, 1.0] or int, default=1

When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

max_features : int or None, default=None

If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. This parameter is ignored if vocabulary is not None.

In first instance, try to play with max_df and min_df. If the size is still not suitable with your requirements, you can drop the size as you like using the max_features.

NOTE:
The max_features tuning can drop your classification accuracy by an higher ratio than the other parameters

How to shrink a bag-of-words model?

2 Answers2