20

I have a dataframe with text columns. I separated them into x_train and x_test.

My question is if its better to do Keras's Tokenizer.fit_on_text() on the entire x data set or just x_train?

Like this:

tokenizer = Tokenizer()
tokenizer.fit_on_texts(x_data)

or

tokenizer.fit_on_texts(x_train)        # <- fixed typo
tokenizer.texts_to_sequences(x_train)

Does it matter? I'd also have to tokenize x_test later too, so can I just use the same tokenizer?

jjm
  • 503
  • 4
  • 6
The Dodo
  • 711
  • 5
  • 14

1 Answers1

24

Although the information in this question is good, indeed, there are more important things that you need to notice:

You MUST use the same tokenizer in training and test data

Otherwise, there will be different tokens for each dataset. Each tokenizer has an internal dictionary that is created with fit_on_texts.

It's not guaranteed that train and test data will have the same words with same frequencies, so each dataset will create a different dictionary, and all results from test data will be wrong.

This also means that you cannot fit_on_texts, train and then fit_on_texts again: this will change the internal dictionary.

It's possible to fit on the entire data. But it's probably a better idea to reserve a token for "unknown" words (oov_token=True), for the cases when you find new test data with words your model has never seen (this requires that you replace rare words in training data with this token too).

As @Fernando H metioned, it is probably be better to fit the tokenizer only with train data (even though, you must reserve an oov token even in training data (the model must learn what to do with the oov).


Testing the tokenizer with unknown words:

The following test shows that the tokenizer completely ignores unknown words when oov_token is not set. This might not be a good idea. Unknown words may be key words in sentences and simply ignoring them might be worse than knowing there is something unknown there.

import numpy as np
from keras.layers import *
from keras.models import Model
from keras.preprocessing.text import Tokenizer

training = ['hey you there', 'how are you', 'i am fine thanks', 'hello there']
test = ['he is fine', 'i am fine too']

tokenizer = Tokenizer()
tokenizer.fit_on_texts(training)

print(tokenizer.texts_to_sequences(training))
print(tokenizer.texts_to_sequences(test))

Outputs:

[[3, 1, 2], [4, 5, 1], [6, 7, 8, 9], [10, 2]]
[[8], [6, 7, 8]]

Now, this shows that the tokenizer will attibute index 1 to all unknown words:

tokenizer2 = Tokenizer(oov_token = True)
tokenizer2.fit_on_texts(training)
print(tokenizer2.texts_to_sequences(training))
print(tokenizer2.texts_to_sequences(test))

Outputs:

[[4, 2, 3], [5, 6, 2], [7, 8, 9, 10], [11, 3]]
[[1, 1, 9], [7, 8, 9, 1]]

But it might be interesting to have a group of rare words in training data replaced with 1 too, so your model has a notion of how to deal with unknown words.

Community
  • 1
  • 1
Daniel Möller
  • 84,878
  • 18
  • 192
  • 214
  • 1
    Thank you very much for this answer! I was having nightmares with this issue, nobody that I talked seemed to see a problem. I am really glad that someone with a high score on SO has answered it. Doing fit_on_texts on the whole dataset seemed like a bit of data leakage to me. [Even keras official blog did the fit on all the dataset, which made things even more confusing.](https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html) Just one question, do you know any research papers addressing this problem of creating a training vocabulary using all the dataset? – xicocaio May 16 '19 at 21:25
  • Sorry, although I'm good with Keras and understand things in general, I haven't read many papers. – Daniel Möller May 17 '19 at 00:52
  • 4
    @xicocaio The main idea of dividing your dataset into train and test is to evaluate your model for future unkown situations in a objetive way. That's said, if you fit your tokenizer on whole dataset you are somehow biasing your model. For a good evaluation of your model, you have to take in account the UNK tokens. So as any other kind of "feature extraction" the best practices are to only fit on train and apply to all. – OSainz Jul 08 '19 at 05:49
  • @DanielMöller @OSainz Do I have to do the same if I have train, validation, and test? Basically I split my data into 80% training and 20% testing, I also further split my training data into 90% training and 10% validation. Do I have to fit my tokenization on the 90% training data and do ```text_to_sequences``` on the 90% training and 10% validation? Then, retrain the model on the original 80% training and fit on it and re-do ```text_to_sequences``` on the 80% training and 20% testing ? – Perl Del Rey Nov 21 '19 at 09:26
  • 1
    @Hiyam One tokenization only, you must keep the tokenizer for future data, the model must see the same tokens consistently. – Daniel Möller Nov 21 '19 at 09:49
  • @DanielMöller Thank you for your reply. So If I split my 80% training data into 90% training and 10% validation, I do the tokenization only on the 90% training and use it everywhere in my model (on the 10% validation and when I **re-train**)? – Perl Del Rey Nov 21 '19 at 09:56
  • Tokenize the entire data once. – Daniel Möller Nov 21 '19 at 09:58
  • @DanielMöller sorry for the inconvenience, but we were saying we must not tokenize the whole data before train/test split right ? ***(I mean the ```fit_on_text```)*** – Perl Del Rey Nov 21 '19 at 10:01
  • 1
    `fit_on_ text`, must be used only one, a single time on the entire data (so before the split). Notice the comment in the answer about rare words (it will be important). You cannot `fit_on_text` more than once. Later you can split the data the way you want (you can split it already tokenized or not, no problem when you call `text_to_sequences`) – Daniel Möller Nov 21 '19 at 10:08
  • 1
    Dont you guys think fitting on the entire dataset will result on 'better than real' results on the validation and test sets? I mean, there is a probability of your model finding unknown words on the prediction phase, so eliminating this chance on the evaluation stage will create results that does not reflects the real prediction performance. I think tokenizing just the training data using oov_token is a better way to describe future results performance on eval and test phases. Am I wrong? – Fernando H'.' Mar 28 '20 at 12:08
  • You're not wrong. If you think of it, it doesn't matter much whether test values are tokenized in train. If they're not in the train set, they will never be trained anyway, so the results might even be worse than using an OOV. – Daniel Möller Mar 28 '20 at 21:18
  • @DanielMöller Could you not do `tokenizerCol1 = Tokenizer()` for the `Col1`, `tokenizerCol1.text_to_sequence()` your test set for `Col1`, and then follow the same process for `Col2`? Where,`tokenizerCol2 = Tokenizer()` for `Col2`. Would this not allow you to use two different ones? – StackPancakes Aug 25 '20 at 14:51
  • @StackPancakes, if they're two columns of different nature, yes. If they are similar, maybe it's better to make a single tokenizer. – Daniel Möller Aug 26 '20 at 12:15