I am beginning with time series classification and have some trouble understanding how my training set should be constructed. My current data look like this:
Timestamp User ID Feature 1 Feature 2 ... Feature N target
2002-10-30 1 0 0 ... 1 0
2002-10-31 2 0 1 ... 1 0
...
...
2017-10-30 1 0 0 ... 0 1
2017-10-31 2 0 1 ... 0 0
The features are one-hot encoded text features, recorded at time t
for a given User ID
. The target is an event occurring / not occurring at time t
. I am willing to detect this event given a new set of features for all the User IDs
of the dataset, at a new given time t
.
I understood from this paper that one way to model this is by using a "sliding windows classifier".
For any time t
, I could aggregate together the features from t, t-1, ... t-n
and set a more flexible target that would be "the event occurred or not at either t, t+1, ... t+n
". Is this the correct way to build such a classifier?
I am also considering more recent approaches like "recurrent neural network architectures (LSTM)". How could I build a training set to feed this model from the dataset above?
ps: I plan to use scikit-learn / Keras to build the classifiers.
Thanks in advance for your time and answers.