5

I am doing a project in Machine Learning and for that I am using the pickle module of Python.

Basically, I am parsing through a huge data set which is not possible in one execution that is why I need to save the classifier object and update it in the next execution.

So my question is, when I run the program again with the new data set then will the already created pickle object be modified (or updated). If not then how can I update the same pickle object every time I run the program.

save_classifier = open("naivebayes.pickle","wb")
pickle.dump(classifier,save_classifier)
save_classifier.close()
arqam
  • 3,582
  • 5
  • 34
  • 69
  • 2
    Nothing happens automatically. Your next run of the program needs to open the file, load the pickle in it back into a normal Python object, modify that object, then save it back out just like you did above. – Kirk Strauser Apr 22 '16 at 14:23
  • @KirkStrauser That's what I am saying. I should leave my code like this only right? For the next run. The already created naivebayes.pickle will get updated right? – arqam Apr 22 '16 at 14:26
  • Does the classifier fit into RAM without impacting the rest of your calculations? – sobek Apr 22 '16 at 14:27
  • @Arqam There's nothing at all "special" about a file holding a pickle. It's just a regular file. If you update `classifier` and then run your code above again, `naivebayes.pickle` will then hold the new version. But that won't happen on its own: until you run the `pickle.dump` line, none of your modifications to `classifier` will be written out to `naivebayes.pickle`. – Kirk Strauser Apr 22 '16 at 14:33
  • @sobek I am not doing in complete data set at once that's why it is able to fit. I am breaking the data set and then modify my classifier object by training on each sub divided data set. – arqam Apr 22 '16 at 14:33
  • Yes, but using pickle might be quite inefficient. You could use some form of caching db that lives in RAM. Might be a lot more performant. – sobek Apr 22 '16 at 14:37
  • @sobek But isn't the basic of Machine Learning that we create a classifier that is very well trained and could be used anywhere. And the only way I see to save the classifier object and updating is by using pickle. If you know anything else then please tell. – arqam Apr 22 '16 at 14:40

1 Answers1

3

Unpickling your classifier object will re-create it in the same state that it was when you pickled it, so you can proceed to update it with fresh data from your data set. And at the end of the program run, you pickle the classifier again and save it to a file again. It's a Good Idea to not overwrite the same file but to keep a backup (or even better, a series of backups), in case you mess something up. That way, you can easily go back to a known good state of your classifier.

You should experiment with pickling, using a simple program and a simple object to pickle and unpickle, until you're totally confident with how this all works.


Here's a rough sketch of how to update the pickled classifier data.

import pickle
import os
from os.path import exists
# other imports required for nltk ...

picklename = "naivebayes.pickle"

# stuff to set up featuresets ...

featuresets = [(find_features(rev), category) for (rev, category) in documents]
numtrain = int(len(documents) * 90 / 100)
training_set = featuresets[:numtrain]
testing_set = featuresets[numtrain:]

# Load or create a classifier and apply training set to it
if exists(picklename):
    # Update existing classifier
    with open(picklename, "rb") as f:
        classifier = pickle.load(f)
    classifier.train(training_set)
else:
    # Create a brand new classifier    
    classifier = nltk.NaiveBayesClassifier.train(training_set)

# Create backup
if exists(picklename):
    backupname = picklename + '.bak'
    if exists(backupname):
        os.remove(backupname)
    os.rename(picklename, backupname)

# Save
with open(picklename, "wb") as f:
    pickle.dump(classifier, f)

The first time you run this program it will create a new classifier, train it with the data in training_set, then pickle classifier to "naivebayes.pickle". Each subsequent time you run this program it will load the old classifier and apply more training data to it.


BTW, if you are doing this in Python 2 you should use the much faster cPickle module; you can do that by replacing

import pickle 

with

import cPickle as pickle
PM 2Ring
  • 54,345
  • 6
  • 82
  • 182
  • So the classifier object previously saved will not be removed when I run `pickle.dump` on the same object, instead it will be modified right? – arqam Apr 22 '16 at 14:35
  • 1
    @Arqam: Depends on whether you specify a different filename when you save the classifier. Use the same name every time and it will effectively remove the previous version when it rewrites the file. – martineau Apr 22 '16 at 14:46
  • When you do `pickle.dump(classifier,save_classifier)` it will save the pickled representation of the current `classifier` object to the open file `save_classifier`, overwriting the old contents if that file already exists. And that's why I suggested to save your pickled data in a series of files. Eg `naivebayes000.pickle`, `naivebayes001.pickle`, etc. – PM 2Ring Apr 22 '16 at 14:47
  • @PM2Ring But Ultimately I will have to use the fully trained one classifier object, thus saved in one file for classification? And the thing what martineau said, is it correct? – arqam Apr 22 '16 at 14:50
  • @PM2Ring At one time I can use one pickled classifier for training right? So how does having so many files will be helpful? – arqam Apr 22 '16 at 14:53
  • @Arqam: Yes, martineau is correct. You don't _need_ to save a series of files, but it is a safer way to work. If you make a mistake with one of your training sessions and supply bad data then your classifier will be mis-trained. But if you keep a series of pickles, then you can just go back to an earlier version and continue training, rather than having to start again from scratch. – PM 2Ring Apr 22 '16 at 14:54
  • @PM2Ring But If I have so many files, I will be able to load only one right? using `pickle.load()` . So how can I use all those many files created for classification? – arqam Apr 22 '16 at 14:58
  • @Arqam: Normally you'd just load the most recent pickle file, the one with the biggest number. But if you make a mistake with the training, then you can tell the program to load an earlier version. – PM 2Ring Apr 22 '16 at 15:01
  • @PM2Ring Can you please come to chat? – arqam Apr 22 '16 at 15:02
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/109929/discussion-between-pm-2ring-and-arqam). – PM 2Ring Apr 22 '16 at 15:03
  • The code is working but about the updation part, the pickle file remains the same size even after I run the code 10 times. Is it possible that it is taking new values replacing the old ones here? – arqam Apr 22 '16 at 17:49
  • @Arqam Maybe, but that depends on how that `.train ` method works, it's got nothing to do with `pickle`. – PM 2Ring Apr 23 '16 at 01:34
  • @PM2Ring No, I got the result by just doing `open(picklename,"ab")`, so updation does happen in the pickle file I guess. The size of pickle file doubled this time when I used the above command. – arqam Apr 23 '16 at 18:03
  • @arqam That just appends the new classifier object as a separate object. Please see http://stackoverflow.com/questions/12761991/how-to-use-append-with-pickle-in-python – PM 2Ring Apr 24 '16 at 02:49