3

I have a very large data set and am using Keras' fit_generator to train a Keras model (tensorflow backend). My data needs to be normalized across the entire data set however when using fit_generator, I have access to relatively small batches of data and normalization of the data in this small batch is not representative of normalizing the data across the entire data set. The impact is quite large (I tested it and the model accuracy is significantly degraded).

My question is this: What is the correct practice of normalizing data across entire data set when using Keras' fit_generator? One last point: my data is a mix of text and numeric data and not images, and hence I am not able to use some of the capabilities in Keras' provided image generator which may address some of the issues for image data.

I have looked at normalizing the full data set prior to training ("brute-force" approach, I suppose) but I am wondering if there is a more elegant way of doing this.

Eric Broda
  • 6,701
  • 6
  • 48
  • 72
  • Consider looking [here](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing) – Tristhal Jun 04 '18 at 14:26

2 Answers2

3

The generator does allow you to do on-the-fly processing of data but pre-processing the data prior to training is the preferred approach:

  1. Pre-process and saving avoids processing the data for every epoch, you should really just do small operations that can be applied to batches. One-hot encoding for example is a common one while tokenising sentences etc can be done offline.
  2. You probably will tweak, fine-tune your model. You don't want to have the overhead of normalising the data and ensure every model trains on the same normalised data.

So, pre-process once offline prior to training and save it as your training data. When predicting you can process on-the-fly.

nuric
  • 11,027
  • 3
  • 27
  • 42
0

You would do this via pre-processing your data to a matrix. One hot encode your text data:

from keras.preprocessing.text import Tokenizer
# X is a list of text elements
t = Tokenizer()
t.fit_on_texts(X)
X_one_hot = t.texts_to_matrix(X)

and normalize your numeric data via:

for i in range(len(matrix)):
  refactored_array = (matrix[i]- np.min(matrix[i], 0)) / (np.max(matrix[i], 0) + 0.0001)  

If you concatenate your two matrices you should have properly preprocessed your data. I just could imagine that the text will always be influencing the outcome of your model too much. So it would make sence to train seperate models for text and numeric data.

r3dapple
  • 431
  • 5
  • 16