Keras: Tokenizer with fit_generator() on text data

Question

I am creating a neural net on a very large text dataset using keras. To build the model and make sure everything was working, I read a fraction of the data into memory, and use the built in keras 'Tokenizer' to do the necessary preprocessing stuff, including mapping each word to a token. Then, I use model.fit().

Now, I want to extend to the full dataset, and don't have the space to read all the data into memory. So, I'd like to make a generator function to sequentially read data from disk, and use model.fit_generator(). However, if I do this, then I separately fit a Tokenizer object on each batch of data, providing a different word-to-token mapping for each batch. Is there anyway around this? Is there any way I can continuously build a token dictionary using keras?

1) show some code of what you are currently doing. 2) why not separate the preprocessing task, save the mapping object on your hard drive and then make the transform happen in the generation of batches? — Nassim Ben, Mar 03 '17 at 05:52

score 4 · Accepted Answer · answered Mar 03 '17 at 11:28

4

So basically you could define a text generator and feed it to fit_on_text method in a following manner:

Assuming that you have texts_generator which is reading partially your data from disk and returning an iterable collection of text you may define:
```
def text_generator(texts_generator):
    for texts in texts_generator:
        for text in texts:
            yield text
```
Please take care that you should make this generator stop after reading a whole of data from disk - what could possible make you to change the original generator you want to use in model.fit_generator
Once you have the generator from 1. you may simply apply a tokenizer.fit_on_text method by:
```
tokenizer.fit_on_text(text_generator)
```

answered Mar 03 '17 at 11:28

Marcin Możejko

39,542
10
109
120

Thanks, I didn't think of passing a generator to the 'fit_on_texts' method itself. I'll try this and let you know how it works. – Ben F Mar 06 '17 at 15:49
I checked that it should work. Beware that generator should stop at some moment. – Marcin Możejko Mar 06 '17 at 15:49
1

@MarcinMożejko Thanks for the solution! I want to clarify that, when we actually call the `text_generator`? And do we have two generators here? One for `fit_generator` and one for `fit_on_text`? – emremrah May 17 '19 at 12:57

Keras: Tokenizer with fit_generator() on text data

1 Answers1