Question 1 and 2
Case 1, your data fits in your memory
Just load the data into arrays and pad the data:
import pandas as pd
import numpy as np
import os
from keras.preprocessing.sequence import pad_sequences
#your class folders - choose the correct names
folder0 = "class0"
folder1 = "class1"
#x and y initially as lists
fileContents = []
fileClasses = []
#list of files in each dir
files0 = os.listdir(folder0)
files1 = os.listdir(folder1)
#load data for class 0
for f in files0:
f = folder0 + "/" + f
if '.csv' in f:
frame = pd.read_csv(f) #use header=None if you don't have headers in the files
fileContents.append(frame.values)
fileClasses.append(0) #append the correct class
print(frame.values)
#load data for class 1
for f in files1:
f = folder1 + "/" + f
if '.csv' in f:
frame = pd.read_csv(f)
fileContents.append(frame.values)
fileClasses.append(1) #append the correct class
print(frame.values)
#pad the sequences so they all have the same length and transform into numpy
#choose best value for you, I chose 0 for example
paddedSequences = pad_sequences(fileContents, padding='post', value=0)
x_train = np.array(paddedSequences)
y_train = np.array(fileClasses)
Later, you will need to use a Masking(0)
layer in your model to ignore the 0 values you used for padding.
Case 2, your data doesn't fit in your memory
Create a Python Generator or a keras.utils.Sequence
to use with model.fit_generator()
.
The principle of loading the data is exactly the same as in case 1, but you will do it in smaller batches.
This is also a good opportunity to separate the files by length and output batches of similar length (this means less useless padding)
There are plenty of answers and tutorials explaining how to create both of the options. For instance, Keras documentation teaches Sequence
: https://keras.io/utils/
Question 3
Perfectly correct.