-1

I trying to find anomalies in my time-series data which has 18 features. As Fb-prophet doesn't allow predicting multivariate time series, I was wonder if using PCA before dimensional reduction be a good idea? I have 2 years of data every 15 mins.

There is lot of missing data at random timestamps for each of features. My data is partially labeled. I understand PCA might remove the anomalies. Is there any alternative technique that I could follow?

1 Answers1

0

I will try to answer by separating your question in 3 parts, as I think that you are conflating three different concepts or at least it was not clear to me what are you actually trying to model.

1) Facebook's prophet model's main goal is to make predictions, by modeling a univariate time series as a function of time. The main model equation has the form

y(t) = g(t) + s(t) + h(t) + error_t

where g takes care of the trend, s of seasonality and h of holiday effects.

As such, I would not say that the prophet model is in any way suited to detect outliers in your time series. Prophet can be useful to do prediction for y, with or without presence of outliers (I assume this is what you mean by anomalies?) in your data, but I don't see it being useful for outlier detection.

2) Regarding PCA on time series: To reduce dimensionality in time series (when it makes sense, e.g. when your features are highly correlated) some version of PC could indeed be employed.

You might be able to get away with using static PCA if you can realistically assume that your series are stationary, variance and mean do not change over time. Otherwise, take a look at dynamic PCA and/or dynamic factor analysis.

Again, if by anomalies you mean outliers, I don't see how PCA would remove anomalies either.

3) Regarding detecting outliers in time series: I have seen a recent application of applying isolation forest to detect anomalies (as in, outliers) in time series. This would also work on multidimensional space.

Maybe that class of models might be better suited for your task.


Hope that helps, with further information on your problem I might be able to say more.

Jean_N
  • 489
  • 1
  • 4
  • 19
  • 1. @Jean_N My data has values from the different sensors of the machine. By anomaly I mean the faulty data points which would be outliers or anything that is not normal sensor behavior. 2. I have already applied isolation forest and one class SVM. But my biggest problem is don't have proper labels( for faulty points) to calculated my accuracy or verify them. – Shreya Rajput Sep 11 '19 at 17:56
  • 2. I have already applied isolation forest and one class SVM. But my biggest problem is don't have proper labels( for faulty points) to calculated my accuracy or verify them. 3. I have split my data into test and train. With fb prophet the approach I am taking is: Calculating the distance between predicted values and actual values. The prediction with great distance from actual values are anomalous. But this approach requires me to train my model on clean data. – Shreya Rajput Sep 11 '19 at 18:05
  • OK, I see where you are going with this. But I find it very strange - there is no way Prophet would know what is an outlier and what is not, and hence these observations will influence the fitting itself - which might result in a poor model. In the end, if you cannot trust the estimated values for the mean, the distance to the observations (potential outliers) is not very trustable either. – Jean_N Sep 11 '19 at 18:26
  • Regarding isolation forest: if you have unlabeled data, you have unlabeled data and that is it, you cannot compute the accuracy if you have no label. IF will nevertheless estimate for you points that are far from the others on a multidimensional feature space. It is reasonable to assume that those points are outliers. Lastly - you mentioned that *some* of your data is labeled. You could apply isolation forest on those points (splitting in train and test) and check the accuracy on the test set of those that are labeled as outliers. That will give you confidence to apply IF on the whole dataset – Jean_N Sep 11 '19 at 18:28
  • Thank You. That sounds good. I will try that. Would you suggest using LSTM auto-encoder? It's suggested. But I think my data set is too huge for it. I have two years of data every 15 min. – Shreya Rajput Sep 11 '19 at 18:53
  • I am more inclined towards the fb model as it takes care of the seasonality. My data set has alot of seasonality. I am refering this- https://towardsdatascience.com/anomaly-detection-time-series-4c661f6f165f – Shreya Rajput Sep 11 '19 at 18:59
  • @ShreyaRajput, I see, thank you for giving this link. I see now where this comes from. Part 1) I never looked at it from that angle and I still find it very weird. In its simplest possible form (no seasonalities or holiday effects, just the linear trend part), Prophet is a (bayesian) linear model. At each step, it can give you the whole posterior distribution for y_hat, so that you can also calculate the probability of observing the actual y given that your model is true. I think this is the essence of this approach that you linked to. – Jean_N Sep 12 '19 at 09:05
  • Part 2) But this requires that you can trust your model...that you just fitted with the outlier (i.e. wrong) data! The fit itself (that you use to calculate a distance to) is affected by these outliers. Infact, if you look at prophet's documentation here: https://facebook.github.io/prophet/docs/outliers.html, the suggestion is to remove the outliers. You can even "fit" the outliers (so that they won't be considered as outliers by the model anymore), if you start tweaking the model and e.g. increase the number of trend changepoints (try it). – Jean_N Sep 12 '19 at 09:08
  • Part 3) Maybe a way to see it is to consider these datapoints as outlier, **given** a fixed Prophet model that you either do not touch (e.g. you take the default parameters or you have fixed set of parameters). But you see my point right - the outliers themselves influence the model fit (that is why they are undesired), so using the model fit to identify the outliers is a bit of a circular argument. – Jean_N Sep 12 '19 at 09:15
  • part 4) If you are seasonality you are worried about, then maybe an idea would be to estimate a model with seasonality (you could do STL decomposition or even prophet), then subtract it from the data, and then subsequently apply something like IF on the rest of the data. That would be my approach. Regarding your LSTM auto-encoder question - I have no idea, I have never used them. Finally, if my original answer, answered your main question, please mark it as answered. – Jean_N Sep 12 '19 at 09:17