2

Classification problem: Data is split into two folders. The CSVs include only the data. The code for my example model is:

model = Sequential()
model.add(CuDNNLSTM(3, input_shape=(None, 3), return_sequences=False))
model.add(Dropout(0.1))
model.add(Dense(1, activation='softmax'))

Question 1: Any alternative to the generator of keras with control what files being loaded?
Question 2: Is there anything else to make variable timesteps possible other than batchsize of 1? Question 3: Would this be correct code to have variable timestep length accepted by the LSTM? If not, please suggest a better way.

input_shape=(None, 3)
devnull
  • 430
  • 2
  • 10

1 Answers1

2

Question 1 and 2

Case 1, your data fits in your memory

Just load the data into arrays and pad the data:

import pandas as pd
import numpy as np
import os
from keras.preprocessing.sequence import pad_sequences

#your class folders - choose the correct names
folder0 = "class0"
folder1 = "class1"

#x and y initially as lists
fileContents = []
fileClasses = []

#list of files in each dir
files0 = os.listdir(folder0)
files1 = os.listdir(folder1)

#load data for class 0
for f in files0:
    f = folder0 + "/" + f
    if '.csv' in f:
        frame = pd.read_csv(f) #use header=None if you don't have headers in the files
        fileContents.append(frame.values)
        fileClasses.append(0) #append the correct class
        print(frame.values)

#load data for class 1
for f in files1:
    f = folder1 + "/" + f
    if '.csv' in f:
        frame = pd.read_csv(f)
        fileContents.append(frame.values)
        fileClasses.append(1) #append the correct class
        print(frame.values)

#pad the sequences so they all have the same length and transform into numpy
#choose best value for you, I chose 0 for example
paddedSequences = pad_sequences(fileContents, padding='post', value=0) 

x_train = np.array(paddedSequences)
y_train = np.array(fileClasses)

Later, you will need to use a Masking(0) layer in your model to ignore the 0 values you used for padding.

Case 2, your data doesn't fit in your memory

Create a Python Generator or a keras.utils.Sequence to use with model.fit_generator().

The principle of loading the data is exactly the same as in case 1, but you will do it in smaller batches.

This is also a good opportunity to separate the files by length and output batches of similar length (this means less useless padding)

There are plenty of answers and tutorials explaining how to create both of the options. For instance, Keras documentation teaches Sequence: https://keras.io/utils/

Question 3

Perfectly correct.

Community
  • 1
  • 1
Daniel Möller
  • 84,878
  • 18
  • 192
  • 214
  • Thank you for your answer! I appreciate your detailed explanation. The code and your tips about batch-size, generators etc. made the input into Keras clear for me. – devnull Oct 26 '19 at 17:03