0

I have a data with 6000 records. I am having a train, validate and test set of 60-20-20. I am getting an accuracy of around 76% with XGboost. I converted my data into Time series and I apply LSTM/1-D Convnets and the accuracy is around 60%. Is my dataset too small for deep learning?

Secondly, can apply SMOTE on each of the train, test and validate set (After splitting the data) I know SMOTE should not be applied before splitting the data into train/test/validate. Is it okay to upsample, train/test/validate sets after splitting them?

If upsample the train/test/validate sets afters splitting them, I get better results with LSTM around (80%) But is this approach, right? I just want to show that with more data, we can improve the accuracy of the deep learning algorithm.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • Not a *programming* (or `python`) question, hence arguably off-topic here; better suited for [Cross Validated](https://stats.stackexchange.com/help/on-topic). – desertnaut Sep 04 '19 at 15:18

1 Answers1

0

In general, SMOTE should only be applied on the train, and you tune hyperparameters with valid and leave test alone.

In your case, I am not sure how you apply SMOTE on the time series data. There should be some assumptions on that which may influence your result.

Shenan
  • 338
  • 1
  • 6
  • But what if we apply SMOTE separately on training, validation and test sets? I am creating a new SMOTE object every time. I am not applying SMOTE and then splitting. So technically the information cant get leaked from the training set to test set. And I applied SMOTE on regular data and converted that to time series – Usman Malik Sep 03 '19 at 18:08
  • SMOTE is an approximation of the real target. To have only the true target in validation and test dataset, you have the most accurate validation and test result. I am still concerned with the oversampling and converting to time series part. Can you be more elaborate on how you convert regular to time series because I think the key concept in time series is the cross-time correlation and simple SMOTE can not do that. – Shenan Sep 03 '19 at 18:17
  • Basically, I have a meeting dataset. Where multiple people interact with each other. I have to predict who will be the next speaker. Since the previous series of dialogues can have affect on the next speaker. I am using data from the previous dialogues such as previous speaker, previous dialogue etc to construct a time series. – Usman Malik Sep 03 '19 at 18:43
  • I think if you put all the previous dialogues and contexts as features and run the SMOTE it could work. In a statistics perspective, this will cause problems due to the neglect of the covariance in the oversampling method. But for deep learning method, I am not an expert and so far I think it won't cause problems in predicting. – Shenan Sep 03 '19 at 19:06