66

I've trained a sentiment classifier model using Keras library by following the below steps(broadly).

  1. Convert Text corpus into sequences using Tokenizer object/class
  2. Build a model using the model.fit() method
  3. Evaluate this model

Now for scoring using this model, I was able to save the model to a file and load from a file. However I've not found a way to save the Tokenizer object to file. Without this I'll have to process the corpus every time I need to score even a single sentence. Is there a way around this?

Marcin Możejko
  • 39,542
  • 10
  • 109
  • 120

6 Answers6

117

The most common way is to use either pickle or joblib. Here you have an example on how to use pickle in order to save Tokenizer:

import pickle

# saving
with open('tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

# loading
with open('tokenizer.pickle', 'rb') as handle:
    tokenizer = pickle.load(handle)
today
  • 32,602
  • 8
  • 95
  • 115
Marcin Możejko
  • 39,542
  • 10
  • 109
  • 120
34

Tokenizer class has a function to save date into JSON format:

tokenizer_json = tokenizer.to_json()
with io.open('tokenizer.json', 'w', encoding='utf-8') as f:
    f.write(json.dumps(tokenizer_json, ensure_ascii=False))

The data can be loaded using tokenizer_from_json function from keras_preprocessing.text:

with open('tokenizer.json') as f:
    data = json.load(f)
    tokenizer = tokenizer_from_json(data)
Max
  • 1,534
  • 1
  • 18
  • 24
  • tokenizer_from_json doesnt seem to be available in Keras anymore, or rather its not listed in their docs or available in the package in conda @Max you still do it this way? – benbyford May 03 '19 at 14:08
  • 2
    @benbyford I use `Keras-Preprocessing==1.0.9` package from PyPI and the function [is avaiable](https://github.com/keras-team/keras-preprocessing/blob/0494094a3ba341a67fdb9960e326fe6b9f582708/keras_preprocessing/text.py#L488) – Max May 13 '19 at 00:33
  • 4
    `tokenizer_to_json` should be available on tensorflow > 2.0.0 at some point soon, see this [pr](https://github.com/tensorflow/tensorflow/pull/31946) In the meantime `from keras_preprocessing.text import tokenizer_from_json` can be used – Manuel Oct 30 '19 at 15:56
  • This worked for me. Thank you – MarkK Dec 21 '21 at 20:50
10

The accepted answer clearly demonstrates how to save the tokenizer. The following is a comment on the problem of (generally) scoring after fitting or saving. Suppose that a list texts is comprised of two lists Train_text and Test_text, where the set of tokens in Test_text is a subset of the set of tokens in Train_text (an optimistic assumption). Then fit_on_texts(Train_text) gives different results for texts_to_sequences(Test_text) as compared with first calling fit_on_texts(texts) and then text_to_sequences(Test_text).

Concrete Example:

from keras.preprocessing.text import Tokenizer

docs = ["A heart that",
         "full up like",
         "a landfill",
        "no surprises",
        "and no alarms"
         "a job that slowly"
         "Bruises that",
         "You look so",
         "tired happy",
         "no alarms",
        "and no surprises"]
docs_train = docs[:7]
docs_test = docs[7:]
# EXPERIMENT 1: FIT  TOKENIZER ONLY ON TRAIN
T_1 = Tokenizer()
T_1.fit_on_texts(docs_train)  # only train set
encoded_train_1 = T_1.texts_to_sequences(docs_train)
encoded_test_1 = T_1.texts_to_sequences(docs_test)
print("result for test 1:\n%s" %(encoded_test_1,))

# EXPERIMENT 2: FIT TOKENIZER ON BOTH TRAIN + TEST
T_2 = Tokenizer()
T_2.fit_on_texts(docs)  # both train and test set
encoded_train_2 = T_2.texts_to_sequences(docs_train)
encoded_test_2 = T_2.texts_to_sequences(docs_test)
print("result for test 2:\n%s" %(encoded_test_2,))

Results:

result for test 1:
[[3], [10, 3, 9]]
result for test 2:
[[1, 19], [5, 1, 4]]

Of course, if the above optimistic assumption is not satisfied and the set of tokens in Test_text is disjoint from that of Train_test, then test 1 results in a list of empty brackets [].

Quetzalcoatl
  • 2,016
  • 4
  • 26
  • 36
  • 5
    moral of the story: if using word embeddings and keras's Tokenizer, use fit_on_texts only once on a very large corpus; or use character n-grams instead. – Quetzalcoatl Jul 06 '18 at 07:11
  • 3
    I don't understand what's the message you're trying to communicate: why would one fit on test docs in the first place? By definition, whatever it is that you're doing, the test must be kept in a vault as if you didn't know you had it in the first place. – gented Oct 09 '19 at 07:47
  • @gented: you may be confusing unsupervised text parsing with supervised ML. Correct me if I'm wrong, but keras's Tokenizer does not have a loss function attached to it that is meant for generalization; hence, is not a (supervised) machine learning problem -- which appears to be your assumption. The message I was trying to communicate is summarized in my first comment above ("moral of the story..."), which may be worth re-reading. – Quetzalcoatl Oct 23 '19 at 00:30
  • @gented good points. sorry if the nomenclature confused you; I was keeping some consistency with the comments in the accepted answer. – Quetzalcoatl Oct 23 '19 at 20:42
  • I agree with @gented in that you do not want to fit your tokenizer in the test set because then you remove the possibility of oov tokens at test time, defeating the purpose of a test set. It's not about the tokenizer having a loss, but rather about the data from the test set leaking into your training data. – rodrigo-silveira Jun 22 '21 at 21:12
1

I've created the issue https://github.com/keras-team/keras/issues/9289 in the keras Repo. Until the API is changed, the issue has a link to a gist that has code to demonstrate how to save and restore a tokenizer without having the original documents the tokenizer was fit on. I prefer to store all my model information in a JSON file (because reasons, but mainly mixed JS/Python environment), and this will allow for that, even with sort_keys=True

user9170
  • 950
  • 9
  • 18
  • the linked gist looks like a good way to "reload" a trained tokenizer. However, the original question potentially relates to "extending" a previously saved tokenizer to new (test) texts; this part still seems open (otherwise, why "save" a model if it won't be used to "score" new data?) – Quetzalcoatl Jul 06 '18 at 06:32
  • I think their intents are clear "Without this I'll have to process the corpus every time I need to score even a single sentence". From this, I gather that they want to skip the tokenizing step and evaluate the trained model on other data. They don't ask anything else, which is that you are anticipating. They like most people, only want to use previously tokenized on a different data set which is skipped in most tutorials. Therefore, I think my answer 1) answers what was asked, and 2) provides working code. – user9170 Jul 09 '18 at 13:39
  • fair points. the question is "Saving Tokenizer object to file for scoring" so one might assume they're asking about scoring (potentially new data), too. – Quetzalcoatl Jul 09 '18 at 18:58
0

I found the following snippet provided at following link by @thusv89.

Save objects:

import pickle

with open('data_objects.pickle', 'wb') as handle:
    pickle.dump(
        {'input_tensor': input_tensor, 
         'target_tensor': target_tensor, 
         'inp_lang': inp_lang,
         'targ_lang': targ_lang,
        }, handle, protocol=pickle.HIGHEST_PROTOCOL)

Load objects:

with open("dataset_fr_en.pickle", 'rb') as f:
    data = pickle.load(f)
    input_tensor = data['input_tensor']
    target_tensor = data['target_tensor']
    inp_lang = data['inp_lang']
    targ_lang = data['targ_lang']
Peter O.
  • 32,158
  • 14
  • 82
  • 96
Arun
  • 421
  • 3
  • 6
0

Quite easy, because Tokenizer class has provided two funtions for save and load:

save —— Tokenizer.to_json()

load —— keras.preprocessing.text.tokenizer_from_json

In to_json() method,it call "get_config" method which handle this:

    json_word_counts = json.dumps(self.word_counts)
    json_word_docs = json.dumps(self.word_docs)
    json_index_docs = json.dumps(self.index_docs)
    json_word_index = json.dumps(self.word_index)
    json_index_word = json.dumps(self.index_word)

    return {
        'num_words': self.num_words,
        'filters': self.filters,
        'lower': self.lower,
        'split': self.split,
        'char_level': self.char_level,
        'oov_token': self.oov_token,
        'document_count': self.document_count,
        'word_counts': json_word_counts,
        'word_docs': json_word_docs,
        'index_docs': json_index_docs,
        'index_word': json_index_word,
        'word_index': json_word_index
    }
  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-ask). – Community Sep 16 '21 at 04:22