5

I am trying to save a one hot encoder from keras to use it again on different texts but keeping the same encoding.

Here is my code :

df = pd.read_csv('dataset.csv ')
vocab_size = 200000
encoded_docs = [one_hot(d, vocab_size) for d in df.text]

How can I save this encoder and use it again later ?

I found this in my research but one_hot() seems to be a function and not an object (sorry if this is plain wrong I am fairly new to python).

CuriousLearner
  • 121
  • 1
  • 7
  • 1
    Won't pickling it work? I.e. `import pickle; with open("encoder", "wb") as f: pickle.dump(one_hot, f)`. Functions are objects, too. – L3viathan Oct 01 '19 at 13:28
  • Thanks for the anwser, your code works and i am able to save it to a file, but how can I restore and re-use it ? – CuriousLearner Oct 01 '19 at 13:40
  • 1
    `import pickle; with open("encoder", "rb") as f: one_hot = pickle.load(f)` – L3viathan Oct 01 '19 at 13:43
  • Thanks again but when I do `encoder = pickle.load(f)` and then after `encoded_docs =[encoder(d, vocab_size) for d in df.text]` the encoding seems different, as if I retrained the encoder with this line. – CuriousLearner Oct 01 '19 at 13:48
  • How exactly is `one_hot` created? – L3viathan Oct 01 '19 at 14:06
  • Its imported : `from.keras.preprocessing.text import one_hot` and its first and only usage in code is the one I showed in the original question : `encoded_docs = [one_hot(d, vocab_size) for d in df.text]`. – CuriousLearner Oct 01 '19 at 14:08
  • 2
    Aha! This supposed encoding is cheating, it's using `hash()` to generate quasi-unique encodings. Due to hash seed randomization, the numbers will always be different. Start Python with `PYTHONHASHSEED=0 python`, then it should work (and you don't need to pickle the function, just import it). – L3viathan Oct 01 '19 at 14:19
  • Hum interesting answer ! What would you recommend using in my case to feed texts to an embedding layer ? – CuriousLearner Oct 01 '19 at 14:24
  • I usually use [sklearn's OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) for this. I think this is out of scope for this question, though. The solution in my last comment should work. – L3viathan Oct 01 '19 at 14:36
  • That's true, thanks again ! – CuriousLearner Oct 01 '19 at 14:37

2 Answers2

8

Mentioning the Answer in this Section (although it is present in Comments Section), for the benefit of the Community.

To Save the Encoder, you can use the below code:

import pickle
with open("encoder", "wb") as f: 
    pickle.dump(one_hot, f)

Then to Load the Saved Encoder, use the below code:

encoder = pickle.load(f) 
encoded_docs =[encoder(d, vocab_size) for d in df.text]

Since the function, from.keras.preprocessing.text import one_hot uses hash() to generate quasi-unique encodings, we need to use a HashSeed for reproducing our Results (getting same result even after multiple executions).

Run the below code in the Terminal, for Setting the HashSeed:

enter image description here

5

The previous answer is awesome, and I find another available option which needs joblib

from joblib import dump, load
dump(clf, 'filename.joblib') # save the model
clf = load('filename.joblib') # load and reuse the model
Memphis Meng
  • 1,267
  • 2
  • 13
  • 34