10

I have a huge dataset that I need to provide to Keras in the form of a generator because it does not fit into memory. However, using fit_generator, I cannot replicate the results I get during usual training with model.fit. Also each epoch lasts considerably longer.

I implemented a minimal example. Maybe someone can show me where the problem is.

import random
import numpy

from keras.layers import Dense
from keras.models import Sequential

random.seed(23465298)
numpy.random.seed(23465298)

no_features = 5
no_examples = 1000


def get_model():
    network = Sequential()
    network.add(Dense(8, input_dim=no_features, activation='relu'))
    network.add(Dense(1, activation='sigmoid'))
    network.compile(loss='binary_crossentropy', optimizer='adam')
    return network


def get_data():
    example_input = [[float(f_i == e_i % no_features) for f_i in range(no_features)] for e_i in range(no_examples)]
    example_target = [[float(t_i % 2)] for t_i in range(no_examples)]
    return example_input, example_target


def data_gen(all_inputs, all_targets, batch_size=10):
    input_batch = numpy.zeros((batch_size, no_features))
    target_batch = numpy.zeros((batch_size, 1))
    while True:
        for example_index, each_example in enumerate(zip(all_inputs, all_targets)):
            each_input, each_target = each_example
            wrapped = example_index % batch_size
            input_batch[wrapped] = each_input
            target_batch[wrapped] = each_target
            if wrapped == batch_size - 1:
                yield input_batch, target_batch


if __name__ == "__main__":
    input_data, target_data = get_data()
    g = data_gen(input_data, target_data, batch_size=10)
    model = get_model()
    model.fit(input_data, target_data, epochs=15, batch_size=10)  # 15 * (1000 / 10) * 10
    # model.fit_generator(g, no_examples // 10, epochs=15)        # 15 * (1000 / 10) * 10

On my computer, model.fit always finishes the 10th epoch with a loss of 0.6939 and after ca. 2-3 seconds.

The method model.fit_generator, however, runs considerably longer and finishes the last epoch with a different loss (0.6931).

I don't understand in general why the results in both approaches differ. This might not appear like much of a difference but I need to be sure that the same data with the same net produce the same result, independent from conventional training or using the generator.

Update: @Alex R. provided an answer for part of the original problem (some of the performance issue as well as changing results with each run). As the core problem remains, however, I merely adjusted the question and title accordingly.

wehnsdaefflae
  • 826
  • 2
  • 12
  • 27
  • I think you might be better on a site oriented to Python programming. –  Aug 29 '17 at 17:14
  • How big is your training dataset? What happens if you increase the batch size in the fit generator? – Alex R. Aug 29 '17 at 17:23
  • @AlexR. i have ca 2.5 million examples. If I increase the batch size, the loss is still unstable and still different from the loss I get with `model.fit()`. – wehnsdaefflae Aug 29 '17 at 17:26
  • 1
    @mdewey if you know a way to use Keras without Python I'd look forward to hear about it. –  Jun 30 '19 at 10:48
  • `Also each epoch lasts considerably longer.` The reason for that is obviously the overhead related to I/O operations. It comes with the territory. To shorten that you may need a Solid State hard disk. –  Jun 30 '19 at 11:03

6 Answers6

4

I don't understand how the loss can be unstable with larger batch size, as there should be less fluctuations with larger batches. However, looking at Keras documentation, the fit() routine looks like:

fit(self, x, y, batch_size=32, epochs=10, verbose=1, callbacks=None, validation_split=0.0, 
    validation_data=None, shuffle=True, class_weight=None, sample_weight=None, 
    initial_epoch=0)

which has a default batch_size=32 and epochs=10. Wheras the fit_generator() looks like:

fit_generator(self, generator, steps_per_epoch, epochs=1, verbose=1,
              callbacks=None, validation_data=None, validation_steps=None, 
              class_weight=None, max_queue_size=10, workers=1,
              use_multiprocessing=False, initial_epoch=0)

Specifically the "step_per_epoch" are defined by:

steps_per_epoch: Total number of steps (batches of samples) to yield from generator before declaring one epoch finished and starting the next epoch. It should typically be equal to the number of unique samples of your dataset divided by the batch size.

So for starters, it sounds like your fit_generator is taking in massively more numbers of samples, compared to your fit() routine. See here for more details.

Alex R.
  • 1,397
  • 3
  • 18
  • 33
  • thanks for your answer! it resolved part of the problem. you were right. i provided too much samples because i understood the `steps_per_epoch` incorrectly. if i divide the parameter by batch size (as suggested by the documentation), the result reproducibly converges to `0.6931`, but it is still different to the `fit` method and still ca 10 times slower... – wehnsdaefflae Aug 29 '17 at 18:35
  • @wehnsdaefflae: The best I could find is this, and truthfully it makes no sense why the generator is slower when running on comparable inputs to the fit() routine: https://github.com/fchollet/keras/issues/2730 – Alex R. Aug 29 '17 at 18:50
  • See also this, which suggests lowering the validation_step: https://github.com/fchollet/keras/issues/6406#issuecomment-308248241' – Alex R. Aug 29 '17 at 18:51
  • thanks for your research! its good to see that at least the speed problem does not appear to be due to my code (any more). i'll leave the question open for a few more days as the other aspects are still open... – wehnsdaefflae Aug 29 '17 at 18:55
  • on top of this, you could also increase `max_queue_size` in the `fit_generator` to keep producing batches, while training – DJK Aug 29 '17 at 22:08
2

Batch sizes

  • In fit, you're using the standard batch size = 32.
  • In fit_generator, you're using a batch size = 10.

Keras probably runs the weight updates after each batch, so, if you're using batches of different size, there is a chance of getting different gradients between the two methods. And once there a different weight update, both models will never meet again.

Try to use fit with batch_size=10, or use a generator with batch_size=32.


Seed problem?

Are you creating a new model with get_model() for each case?

If so, the weights in both models are different, and naturally you will have different results for both models. (Ok, you've set a seed, but if you're using tensorflow, maybe you're facing this issue)

On the long run they will sort of converge, though. The difference between both doesn't seem that much.


Checking data

If you are not sure that your generator yields the same data as you expect, do a simple loop on it and print/compare/check the data it yields:

for i in range(numberOfBatches):
    x,y = g.next() #or next(g)
    #print or compare x,y here. 

Daniel Möller
  • 84,878
  • 18
  • 192
  • 214
  • thanks for your answer. i guess the tensorflow issue is not the case because `model.fit` returns the same loss in each run. and i compared both aoutputs: they are identical :( – wehnsdaefflae Aug 29 '17 at 21:57
  • Ok, have you tried identical batch sizes? See update in my answer. – Daniel Möller Aug 29 '17 at 22:03
  • 2
    in the code above, you can see that both batch_sizes are set to 10 – wehnsdaefflae Aug 29 '17 at 22:04
  • Ok, two more things I can imagine (but I haven't checked, so forgive me if I'm wrong) are: 1 - The change from lists to numpy arrays may be changing the data type between float32 and float64? Maybe try transforming `get_data()` also in numpy arrays? --- 2 - Is the size of the batch in the generator really 10 at the end of its creation? – Daniel Möller Aug 29 '17 at 22:14
1

Make sure to shuffle your batches within your generator.

This discussion suggests you turn on shuffle in your iterator: https://github.com/keras-team/keras/issues/2389. I had the same problem and this resolved it.

Cerno
  • 751
  • 4
  • 14
0

As for the loss, that is possibly due to the batch size difference that has already been discussed.

As for the difference in training time, model.fit_generator() allows you to specify the number of "workers". This parameter refers to how many instances of your model are being trained across different areas in your dataset at the same time. If your computer architecture is optimized correctly, you should be able to change the workers parameter to 4 or 8 and see large reductions in training time.

Thomas Smyth - Treliant
  • 4,993
  • 6
  • 25
  • 36
Lee James
  • 91
  • 1
  • 5
0

Hope I am not late to the party. The most important thing I would add:

In Keras, using fit() is fine for smaller datasets which can be loaded into memory. For most practical-use cases, almost all datasets are large and cannot be loaded into memory at once.

For larger datasets we have to use fit_generator().

prosti
  • 42,291
  • 14
  • 186
  • 151
  • 1
    If you don't mind me telling, the question is not about when to use `fit()` or `fit_generator()`, to which everybody agrees about, but why they behave different. –  Jul 01 '19 at 09:40
0

Make sure that your generator actually returns different batches each time. I ran into this issue with my generator. When you're initializing your batch numpy placeholders before the while loop then it is possible that even though you might change those variables inside the for loop then the initialized variables might only change once during the first for loop. My issue was exactly that. I had a similarly structured generator but I was returning the batches after the for loop: Why is this python generator returning the same value everytime?

You can check whether your generator works by using this snippet that checks if all the generated batches are indeed different:

g = data_gen(input_data, target_data, batch_size=10)
input_list = []
target_list = []
for _ in range(100):
    input, target = next(g)
    input_list.append(input)
    target_list.append(target)
inputs = np.concatenate(input_list, axis=0)   
targets = np.concatenate(target_list, axis=0)

all_different = True
for i in range(1, inputs.shape[0]):
    if np.array_equal(inputs[0], inputs[i]):
        all_different = False
print('All batches different') if all_different else print('Generator broken. Initialize your numpy arrays inside the while loop or yield input.copy, target.copy()')
Sten
  • 1