0

Dataset looks like:

  • all features x are numerical and scaled except for name (which is currently the indexed alongside year)
[name, year, x1, x2, x3, x4, ...]

josh  2001  ... #the various values for the x_features, for that name, at that time
josh  2002  ...
josh  2003  ... 

bill  2001  ...
bill  2002  ...
bill  2003  ...

I have already applied StandardScaler to my entire time series dataset.

I now am about to use PCA, but I stopped to wonder if it can/should be applied to an entire time series dataset like the one above.

  • I have just finished researching PCA quite heavily, but could not think of a reason why using it on a time series would be any different.
  • Am I forgetting something critical about PCA in respect to time series??

I found some older mentions of Functional PCA, but is this still relevant/needed? Or has SciKit.learn made this obsolete?

Alex
  • 188
  • 11
  • Actual Dataset has around 75 features – Alex Jan 03 '20 at 21:03
  • 2
    For search keyword purposes, it's more accurate to call these "panel data" than "time series data." Here's a related question from the Cross-Validated site: https://stats.stackexchange.com/questions/153873/principal-component-analysis-on-time-series-data-and-panel-data – Kevin Troy Jan 03 '20 at 21:07
  • @KevinTroy That is very helpful for my continued research! Thank you! – Alex Jan 03 '20 at 21:10
  • 1
    The link @KevinTroy provided discusses IID which seems important. IIUC, that implies that you should be doing some differencing of your timeseries in order to make it stationary. HOWEVER!! What I wanted to contribute is that you really need to be aware of what you are doing. Are you using the results of the PCA to make inference about the future? Are you expecting to make inferences over time? (like a back test?) If you are, you need to make sure that you are **NOT** using data that wasn't available at the time of inference. – piRSquared Jan 03 '20 at 21:29
  • 2
    This also means that you **CAN'T** use `StandardScaler` across data that looks into the future relative to your inference. You should roll through each time that you are looking to make an inference and utilize the data that was available at that time. – piRSquared Jan 03 '20 at 21:29
  • @piRSquared Thank you for your feedback. To address your questions: I am not interested with future predictions, nor am I doing a back test. The end goal of my entire process is to rank each name via letter grade, or number_out_of_ten. Unfortunately I do not have an existing target variable feature (it is unsupervised). Thus, I was planning on reducing the dimensionality and performing some kind of unsupervised model, that would eventually lead to a classification. Frankly the latter portion of my process is a bit less thought out, as I am less experienced with unsupervised classification. – Alex Jan 04 '20 at 01:02

0 Answers0