4

The datasets I am working with correspond to individual time series signals. Each signal is unique, with differing total number of data points, though each signal represents the same semantic data (speed in mph).

I am working with Keras, and trying to fit a basic neural network to the data just to evaluate it. Below is the Python code for that:

model = Sequential()
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(64, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

Essentially, I am fitting the model to each dataset as follows:

for file in directory:
    data = pd.read_csv(file)
    # get x_train and y_train ...
    model.fit(X_train, y_train, epochs=10)

Is this a valid way to train a model on multiple datasets of the same semantic data?

Laurent Bristiel
  • 6,819
  • 34
  • 52

1 Answers1

1

Yes, either you can create a model and then call data in loop for training or you can stack data in a single matrix using loop and then call fit function. In first approach you will call fit() n times but in smaller data chunks while in latter approach you will call fit() only once but use a big data matrix.

However, the first one is better as it might be a problem assign all the data in one matrix. So go ahead with your current implementation.

Lawhatre
  • 1,302
  • 2
  • 10
  • 28
  • 1
    Will accuracy vary for each dataset? For example, after 10 epochs for the first model it may reach an accuracy of ~94%. Then for the second dataset the accuracy starts at ~65. Is that normal? –  Nov 02 '20 at 04:18
  • 1
    its ok to get 65% after 94%. This is because, your first dataset may not be covering entire vector space of the data. Hence, second dataset is added, that expands the vector space. However, since the model is trained on 1st set, that has access to limited feature space, it will perform badly. Once its trained on both set, the desired model is obtained which will be more robust and accurate. – Lawhatre Nov 02 '20 at 04:46
  • To test, you can take some points from each dataset and then just train on first dataset. Later, test all the datapoints. You will notice that the datapoints from the first set performed well while other didn't. – Lawhatre Nov 02 '20 at 04:49
  • Then continue with training 2nd dataset. After which the model will perform better on both set 1 and 2 but not so well on set 3. so continue training till all the sets are exhausted. Once you do that and test the datapoints from each dataset, you will notice that the model performs well on all the points. – Lawhatre Nov 02 '20 at 04:51
  • 1
    Great. Is there a way to evaluate the overall accuracy of the model? I know model.evaluate takes only one dataset, but would I essentially need to average out the model.evaluates for each dataset I am training on? –  Nov 03 '20 at 02:40
  • 1
    You can performs Holdout Cross Validation. For each dataset, keep a separate test set. Then, train your model in loop over datasets. Finally, evaluate all the test points. You can then take average of the performance metric you desire. – Lawhatre Nov 03 '20 at 06:14