0

I got multiple Timeseries Dataframes which are like different assets.

The problem is that there are holes in the data (which are not there on the other assets).

Question: What are some qualitative ways to clean the data so that i can fill the lacking rows by something near reality?

extra information:

My first ideas:

  1. LSTM that predicts the lacks (problem: I could only train it on the rows-sequences without holes -> bias)

  2. ARIMA (no idea, just heard of it)

  3. mean of the value after & before (-> unrealistic and this misses outliers & spikes)

  4. what are better approaches? (dropping is no option)

Heres some sample data:

(...which I just wrote by hand as an example, the prices are trash but just to show the holes as NaN values.)

df1
                         Open            High          Low        Close       
Time                                                          
2014-10-10 00:00:00      1.12345      1.12345      1.12345      1.12345
2014-10-13 00:00:00      1.12345      1.12345      1.12345      1.12345
2014-10-14 00:00:00      1.12345      1.12345      1.12345      1.12345
2014-10-15 00:00:00      1.12345      1.12345      1.12345      1.12345
2014-10-16 00:00:00      1.12345      1.12345      1.12345      1.12345
                      ...       ...  ...            ...            ...
2016-02-23 16:00:00      1.12345      1.12345      1.12345      1.12345
2016-02-23 17:00:00      1.12345      1.12345      1.12345      1.12345 
2016-02-23 18:00:00      1.12345      1.12345      1.12345      1.12345
2016-02-23 19:00:00          NaN          NaN          NaN          NaN
2016-02-23 20:00:00      1.12345      1.12345      1.12345      1.12345

df2
                         Open                    High              Low            Close       
Time                                                          
2014-10-10 00:00:00      28391.12345      28391.12352      28391.12332      28391.12347
2014-10-13 00:00:00      28391.12348      28391.12358      28391.12340      28391.12350
2014-10-14 00:00:00              NaN              NaN              NaN              NaN
2014-10-15 00:00:00      28391.12350      28391.12354      28391.12344      28391.12353
2014-10-16 00:00:00      28391.12350      28391.12354      28391.12344      28391.12353
                      ...       ...  ...            ...            ...
2016-02-23 16:00:00      28391.30000      28391.30000      28391.10000      28391.10000
2016-02-23 17:00:00      28391.10000      28391.50000      28391.09000      28391.40000
2016-02-23 18:00:00      28391.12345      28391.12345      28391.12345      28391.12345
2016-02-23 19:00:00      28391.12345      28391.12345      28391.12345      28391.12345
2016-02-23 20:00:00      28391.12345      28391.12345      28391.12345      28391.12345
Benoid
  • 209
  • 1
  • 4
  • 11

1 Answers1

0

You have asked 2 questions here:

1) data cleansing: You should check that there are no trades on missing point dates. i.e. it could be holidays. Checking with other assets may not work unless there are using same trading calendar and have same liquidity. Keep in mind that not all financial markets are trading from Monday to Friday.

2) Best model: you need to do some R&D with a benchmark in mind to find what will work for you. A good model that predicts close may behave badly in predicting the volume.

Maged
  • 818
  • 1
  • 8
  • 17