0

I have two strings. Firstly I tokenize the first string and dump it into a pickle file file.pickle. Then I want to tokenize the second string and then again dump it in the same pickle file file.pickle. I am using the below code:-

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import pickle
text1 = ['I was able to save the model to a file and load from a file.']

tokenizer_t1 = Tokenizer(num_words=10, lower=True)
tokenizer_t1.fit_on_texts(text1)

with open('file.pickle', 'wb') as handle:
    pickle.dump(tokenizer_t1, handle)
    handle.close()

with open('file.pickle', 'rb') as handle:
    tokenizer_txt1 = pickle.load(handle)
    print("word_index : ",tokenizer_txt1.word_index)

# word_index :  {'to': 1, 'a': 2, 'file': 3, 'i': 4, 'was': 5, 'able': 6, 'save': 7, 'the': 8, 'model': 9, 'and': 10, 'load': 11, 'from': 12}
text2 = ['Tokenizer class has a function to save data into JSON format. The accepted answer clearly demonstrates the tokenizer.']

tokenizer_t2 = Tokenizer(num_words=10, lower=True)
tokenizer_t2.fit_on_texts(text2)

with open('file.pickle', 'ab') as handle:
    pickle.dump(tokenizer_t2, handle)
    handle.close()

with open('file.pickle', 'rb') as handle:
    tokenizer_txt2 = pickle.load(handle)
    print("word_index : ",tokenizer_txt2.word_index)

# word_index :  {'to': 1, 'a': 2, 'file': 3, 'i': 4, 'was': 5, 'able': 6, 'save': 7, 'the': 8, 'model': 9, 'and': 10, 'load': 11, 'from': 12}

When I read the file.pickle, I am getting output as:-

word_index : {'to': 1, 'a': 2, 'file': 3, 'i': 4, 'was': 5, 'able': 6, 'save': 7, 'the': 8, 'model': 9, 'and': 10, 'load': 11, 'from': 12}

But my desired output be like:-

{'to': 1, 'a': 2, 'file': 3, 'i': 4, 'was': 5, 'able': 6, 'save': 7, 'the': 8, 'model': 9, 'and': 10, 'load': 11, 'from': 12, 'tokenizer': 13, 'class': 14, 'has': 15, 'function': 16, 'date': 17, 'into': 18, 'json': 19, 'format': 20, 'accepted': 21, 'answer': 22, 'clearly': 23, 'demonstrates': 24}.

It should contain only unique tokens of both strings. How can I do this in python?

Ekanshu
  • 69
  • 1
  • 11

1 Answers1

0

When you pickle data once to a file and then append some more data to the same file, you need to unpickle the file twice or more (depending on how many times you have appended data). See the approved answer here.

with open('file.pickle', 'rb') as handle:
    tokenizer_txt1 = pickle.load(handle)
    tokenizer_txt2 = pickle.load(handle)
    print("word_index : ",tokenizer_txt1.word_index)
    print("word_index : ",tokenizer_txt2.word_index)
Mean Coder
  • 304
  • 1
  • 12
  • When the question is a straight duplicate, flag it as such, don't post an answer that says "This is a duplicate, see this other answer". – ShadowRanger Nov 01 '21 at 10:45
  • @Mean Coder By using the above, I will get two dictionaries. But my question was to combine both into a single pickle file with unique tokens. So that I can load that pickle file and use it in the **texts_to_sequences** function. – Ekanshu Nov 01 '21 at 11:31
  • If you want to pickle just the word_index dictionaries of your tokenizers, one way is to make a json object of both the dictionaries and push to a pickle file at once. – Mean Coder Nov 01 '21 at 11:46
  • @Mean Coder My use case is like this: 1. I am training a model for text classification. For this, I am tokenizing the text of a column using Keras Tokenizer. Then dump this into a pickle file. Once the model is trained, I will use this tokenizer pickle file to tokenize new data before prediction. 2. After some time I want to retrain the model using a new dataset. For this again I am tokenizing the text of a column using Keras Tokenizer. Now I want to dump this new tokenizer into the previously saved pickle file so that.... (continue in next comment) – Ekanshu Nov 01 '21 at 12:37
  • I have a combined pickle file for the old dataset as well as the new dataset. As of now, I don't know how many times the model will be retrained in the future. So I just want to load the existing pickle file and append the new tokenizer whenever a model is retrained. – Ekanshu Nov 01 '21 at 12:37
  • An easy to implement solution to that is you read all the data from pickle file into a variable every time you need to update the file. Then add more data to the variable and dump the variable to the file in write mode. It will overwrite the old data with the new data. – Mean Coder Nov 02 '21 at 04:58