2

I'm working on a project using Keras which has a large amount of input data, and a smaller amount of output/label data (which are images). The mapping of input->output data is contiguous and consistent, i.e. the first 1000 input samples correspond to the first image, the second 1000 input samples correspond to the second image and so forth.

Since the output data are images, having thousands of unnecessary copies of the same image in a numpy array is off the table as it would require an enormous amount of memory. I was looking for a way of having "soft" links in the numpy array, such that indexing simply maps to a smaller array, however I could not find an acceptable way of doing this.

EDIT: I should add a bit more info here as I probably didn't explain the situation properly above.

The project I'm working on takes a video, splits the audio and video, uses the audio for input and uses the individual frames from the video as output. At the "rawest" form, the net will have a single input (one audio sample) and some set of convolution layers to form the outputs.

Of course, the number of input points available (say 48,000 samples per second for 48kHz audio) greatly shadows the number of output points (~24 fps). The immediate simple option (and the option I'd take if my output data was of smaller form) would be to just replicate the data in the array and pony up to the extra RAM usage. Unfortunately this is not an option as it would require increasing the array by about 2000 times, which for an already large dataset, would generate an OOM pretty fast.

Hopefully that's a better explanation of the situation that I'm in. So far, one of the options I've considered/attempted is to overload some functions in the numpy array class, such as getitem, with the intention of just mapping indices to a smaller array. I abandoned this because I'm sure the backend of Keras just takes a contiguous block from numpy and uses that. Another option I've considered is to work with much smaller batches, and to just replicate the images as much as possible, train, and move onto the next set of images. This is messy though (and feels like quitting).

I think the best option, and one that I'll try next, is to use ldavid's suggestion of Keras' TimeDistributed function. If I understand it correctly, I can use it to "batch" the input samples down into a set of samples with the same size of the output data.

platinum95
  • 378
  • 2
  • 11
  • Probably related: https://stackoverflow.com/q/29773918/1531971 –  Mar 22 '18 at 19:17
  • @jdv good ref, but I'm unsure if this applies if you are running in a GPU. – ldavid Mar 22 '18 at 21:34
  • Using raw PCM as input is unlikely to be useful. You may like to build a spectrogram from your input audio data first. That will also reduce its frequency. – Maxim Egorushkin Mar 23 '18 at 12:23

1 Answers1

2

I believe this can be achieved with TimeDistributed and averaging the results.

There's a lot of missing information in your question, but I will assume your input's shape is (batch_size, 224, 224, 3) and your output shape is (batch_size, 7, 7, 512) in order to illustrate how this can be done.

Let's say you have your model (assume it's VGG19) that can assign one input to one output:

from keras import Input, Model, backend as K
from keras.applications import VGG16
from keras.layers import TimeDistributed, Lambda

input_shape = (224, 224, 3)

vgg19 = VGG16(input_shape=input_shape,
              include_top=False,
              weights=None)

You can propagate this model to every one of the 1000 images and combine the outputs like so:

x = Input(shape=(1000, 224, 224, 3))
y = TimeDistributed(vgg19)(x)                         (None, 1000, 7, 7, 512)
y = Lambda(lambda inputs: K.mean(inputs, axis=1))(y)  (None, 7, 7, 512)

model = Model(x, y)
model.compile(loss='mse', optimizer='adam')

Because you are averaging the 1,000 input images, this model should also work when you time distribute among different amounts of samples (e.g. input_shape=(batch_size, 30, 224, 224, 3)).


A working example using MNIST, 10 input images for each label:

import numpy as np
from keras import Input, Model, backend as K
from keras.datasets import mnist
from keras.layers import TimeDistributed, Lambda, Conv2D

(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train, x_test = (np.repeat(x.reshape(x.shape[0], 1, 28, 28, 1), 10, axis=1)
                   for x in (x_train, x_test))
# samples in x_train are repeated 10 times, shape=(60000, 10, 28, 28, 1)

y_train, y_test = (np.tile(y.reshape(y.shape[0], 1, 1, 1), (1, 3, 3, 16))
                   for y in (y_train, y_test))
# samples in y_train are repeated (3, 3, 16) times, shape=(60000, 3, 3, 16)

x = Input((28, 28, 1))
y = Conv2D(16, 3, strides=9, activation='relu')(x)
base_model = Model(x, y)

x = Input(shape=(10, 28, 28, 1))
y = TimeDistributed(base_model)(x)
y = Lambda(lambda inputs: K.mean(inputs, axis=1))(y)

model = Model(x, y)
model.compile(loss='mse',
              optimizer='adam')

print('initial train loss:', model.evaluate(x_train, y_train, verbose=2))
print('initial test loss:', model.evaluate(x_test, y_test, verbose=2))

model.fit(x_train, y_train,
          batch_size=1024,
          epochs=10,
          verbose=2)

print('final train loss:', model.evaluate(x_train, y_train, verbose=2))
print('final test loss:', model.evaluate(x_test, y_test, verbose=2))
initial train loss: 891.6627651529948
initial test loss: 931.27085390625
Epoch 1/10
 - 2s - loss: 383.4519
...
Epoch 10/10
 - 2s - loss: 27.5036
final train loss: 27.394255329386393
final test loss: 27.324540267944336
ldavid
  • 2,512
  • 2
  • 22
  • 38
  • I think this is what I need, and sorry for the lack of clarity in the original question! I've added an edit up there which hopefully explains the situation better. I'll try this suggestion once I have time and will report back here on the results. Thanks! – platinum95 Mar 23 '18 at 09:33
  • If your input data is audio, then time is also an important factor. Although there's a lot of successful work applying conv layers to time series, you might also want to consider a many-to-one `LSTM` or `GRU`. – ldavid Mar 23 '18 at 12:14