Training classifier with large data

Question

I was trying with two class text classification. Usually I created Pickle files of trained model and load those pickle in training phase to eliminate retraining.

When I had 12000 review + more then 50000 tweets for each of the class, the training model size goes to 1.4 GB.

Now storing this large model data into Pickle and loading it is really not feasible and advisable.

Is there any better alternative to this scenario?

Here is sample code, I tried multiple ways of pickleing, here i Have used dill package

    def train(self):
            global pos, neg, totals
            retrain = False

            # Load counts if they already exist.
            if not retrain and os.path.isfile(CDATA_FILE):
                    # pos, neg, totals = cPickle.load(open(CDATA_FILE))
                    pos, neg, totals = dill.load(open(CDATA_FILE, 'r'))
                    return

            for file in os.listdir("./suspected/"):
                    for word in set(self.negate_sequence(open("./unsuspected/" + file).read())):
                            neg[word] += 1
                            pos['not_' + word] += 1
            for file in os.listdir("./suspected/"):
                    for word in set(self.negate_sequence(open("./suspected/" + file).read())):
                            pos[word] += 1
                            neg['not_' + word] += 1

            self.prune_features()

            totals[0] = sum(pos.values())
            totals[1] = sum(neg.values())

            countdata = (pos, neg, totals)
            dill.dump(countdata, open(CDATA_FILE, 'w') )

UPDATE : Reason behind large pickle is, classification data is very large. And I have considered 1-4 gram for feature selection. Classification dataset itself is around 300mb, so considering multigram approach for feature selection creates large training model.

Not at all familiar with `dill`, but have you looked into the pickle it creates? I am guessing that you could identify things which don't need to be pickled, and create a better serialization of your own. More work, obviously, but maybe at least update the question with observations about the reason the pickle is so big ...? — tripleee, Mar 17 '16 at 07:08
@tripleee: reason behind large pickle is, classification data is very large. And I have considered 1-4 gram for feature selection. — user123, Mar 17 '16 at 07:10

score 1 · Answer 1 · answered Mar 17 '16 at 07:12

1

Pickle is very heavy as a format. It stores all the details of the objects. It would be much better to store your data in an efficient format like hdf5. If you are not familiar with hdf5, you can look into storing your data in a simple flat text files. You can use csv or json, depending on your data structure. You'll find that either is more efficient than pickle.

You can look at gzip to create and load compressed archives.

answered Mar 17 '16 at 07:12

DevShark

8,558
9
32
56

I am storing object into pickle. Flat text files will not be able to store object. And single object can not be stored into multiple files, so there should be single file only which contains entire object data. hdf5 I have to check. – user123 Mar 17 '16 at 07:15
If there is no other way to serialize your models, your setup is highly unusual. Most machine learning systems produce some sort of dictionary, mapping a set of input features to a set of outcomes. This could be a list of weighted edges or something like that. From your description, I would expect your results to be articulated as maybe a dictionary mapping input tokens to numeric identifiers, and another mapping tuples of these to a set of category weights. – tripleee Mar 17 '16 at 08:03

score 0 · Answer 2 · answered Jun 21 '20 at 18:13

The problem and solution is explained here. In short, the problem is due to the fact that when doing featurization, e.g. using CountVectorizer, although you might ask for small number of features e.g. max_features=1000, the transformer still keeps a copy of all possible features for debugging purposes, under the hood. For instance, the CountVectorizer has the following attribute:

stop_words_ : set
Terms that were ignored because they either:
- occurred in too many documents (max_df)
- occurred in too few documents (min_df)
- were cut off by feature selection (max_features).
This is only available if no vocabulary was given.

and this causes the model size to become too large. To solve this issue, you can set stop_words_ to None before pickling your model (taken from the above link's example): (please check the link above for details)

import pickle

model_name = 'clickbait-model-sm.pkl'
cfr_pipeline.named_steps.vectorizer.stop_words_ = None
pickle.dump(cfr_pipeline, open(model_name, 'wb'), protocol=2)

Training classifier with large data

2 Answers2