How to save one hot encoder?

Question

I am trying to save a one hot encoder from keras to use it again on different texts but keeping the same encoding.

Here is my code :

df = pd.read_csv('dataset.csv ')
vocab_size = 200000
encoded_docs = [one_hot(d, vocab_size) for d in df.text]

How can I save this encoder and use it again later ?

I found this in my research but one_hot() seems to be a function and not an object (sorry if this is plain wrong I am fairly new to python).

Won't pickling it work? I.e. `import pickle; with open("encoder", "wb") as f: pickle.dump(one_hot, f)`. Functions are objects, too. — L3viathan, Oct 01 '19 at 13:28
Thanks for the anwser, your code works and i am able to save it to a file, but how can I restore and re-use it ? — CuriousLearner, Oct 01 '19 at 13:40
`import pickle; with open("encoder", "rb") as f: one_hot = pickle.load(f)` — L3viathan, Oct 01 '19 at 13:43
Thanks again but when I do `encoder = pickle.load(f)` and then after `encoded_docs =[encoder(d, vocab_size) for d in df.text]` the encoding seems different, as if I retrained the encoder with this line. — CuriousLearner, Oct 01 '19 at 13:48
Its imported : `from.keras.preprocessing.text import one_hot` and its first and only usage in code is the one I showed in the original question : `encoded_docs = [one_hot(d, vocab_size) for d in df.text]`. — CuriousLearner, Oct 01 '19 at 14:08
Aha! This supposed encoding is cheating, it's using `hash()` to generate quasi-unique encodings. Due to hash seed randomization, the numbers will always be different. Start Python with `PYTHONHASHSEED=0 python`, then it should work (and you don't need to pickle the function, just import it). — L3viathan, Oct 01 '19 at 14:19
Hum interesting answer ! What would you recommend using in my case to feed texts to an embedding layer ? — CuriousLearner, Oct 01 '19 at 14:24
I usually use [sklearn's OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) for this. I think this is out of scope for this question, though. The solution in my last comment should work. — L3viathan, Oct 01 '19 at 14:36

score 8 · Accepted Answer · answered Dec 17 '19 at 06:52

Mentioning the Answer in this Section (although it is present in Comments Section), for the benefit of the Community.

To Save the Encoder, you can use the below code:

import pickle
with open("encoder", "wb") as f: 
    pickle.dump(one_hot, f)

Then to Load the Saved Encoder, use the below code:

encoder = pickle.load(f) 
encoded_docs =[encoder(d, vocab_size) for d in df.text]

Since the function, from.keras.preprocessing.text import one_hot uses hash() to generate quasi-unique encodings, we need to use a HashSeed for reproducing our Results (getting same result even after multiple executions).

Run the below code in the Terminal, for Setting the HashSeed:

score 5 · Answer 2 · answered Jul 04 '20 at 01:13

5

The previous answer is awesome, and I find another available option which needs joblib

from joblib import dump, load
dump(clf, 'filename.joblib') # save the model
clf = load('filename.joblib') # load and reuse the model

answered Jul 04 '20 at 01:13

Memphis Meng

1,267
2
13
34

How to save one hot encoder?

2 Answers2