1

I have a stream of data (e.g. 3D position) generating by a system which it looks like:

(pos1, time1) (Pos2, time2) (pos3, time3) ...

I want to use a machine learning technique to estimate the likelihood (or detect) of a particular event from given stream of data. What I have done:

  1. I've tagged my data at every frame by YES if the event occurred at that frame, otherwise it is set to NO.

(pos1, time1, NO) (Pos2, time2, Yes) (pos3, time3, NO) ...(posK, timeK, Yes)...

  1. set a window length like L to train model by giving L consecutive frames and the corresponding tag is set by the tag of the last element on that window:

(pos1, Pos2, pos3, NO) (pos2, Pos3, pos4, NO) (pos3, Pos4, pos5, NO) ... (posK-2, PosK-1, posK, YES) ...

  1. Finally, I trained my model by this set of that.
  2. For Testing, I concatenate L consecutive frames and ask the model to find the corresponding tag for this set of data (e.g. YES or NO).

I realize that occurrence of "NO" is a lot more frequent that "YES". Simply because the system is mostly on idle state and I have no event. So it affects on the training.

Could you give me some hints: 1) what type of machine learning model is the best fit for this problem. 2) At the moment I am classifying the output either "YES" or "NO" but I would like to have the probability of occurrence of the event at anytime. What kind of model is do you suggest?

Thanks

user1021110
  • 213
  • 1
  • 6
  • 13

1 Answers1

1

I think there are actually two questions, here: how to build the dataset, and which predictor to use.

For building the dataset, at some time point i, make sure to choose the ℓ instances happening before i (the phrasing in your question made it seem that you're choosing the one including i). The label of the outcome should be the one at i, though. After all, you're attempting to predict the future based on the present, no? Predicting the present based on the present is rather easy.

Another point is how to choose ℓ, or even whether to choose a single ℓ. Note that if you choose a number of different values of ℓ, then you get a multivariate model.

Finally, the question you directly asked is which predictor to use. This is too wide to answer without knowing your dataset (and playing with it). You might want to read about the bias-variance tradeoff to see why there is no "best" predictor for some problem.

Having said that, I'd suggest that you start with logistic regression which is a simple and robust classifier that also outputs probabilities (as you asked).

Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
  • You mentioned a very good point about the dataset. I will try multivariate logistic regression and follow it up over here. Thank you again! – user1021110 Mar 27 '16 at 01:30