Adjusting and reindexing a drifting time index dataframe

Question

I have a dataset with an average interval of 22.xx seconds between recordings while the median interval is 21 seconds.

I tried to use the DatetimeIndex.floor/ceil/round functions (with 20/21/22 second frequencies), but these lead to duplicated indices and I need to subsequently merge the dataframe with another one, therefore, duplicated indices have to be avoided. Dropping duplicated indices also leads to significant data loss.

I want the dataframe to use an equidistant time interval, preferably rounded to 00/20/40 seconds.

A few sample rows of the data are shown below.

2018-05-06 18:02:24    1.15e+07
2018-05-06 18:02:45    1.35e+07
2018-05-06 18:03:05    1.08e+07
...
2018-05-06 18:08:30    1.11e+07
2018-05-06 18:08:50    1.20e+07
2018-05-06 18:09:10    1.30e+07
...

Which when processed should return the following.

2018-05-06 18:02:20    1.15e+07
2018-05-06 18:02:40    1.35e+07
2018-05-06 18:03:00    1.08e+07
...
2018-05-06 18:08:20    1.11e+07
2018-05-06 18:08:40    1.20e+07
2018-05-06 18:09:00    1.30e+07
...

Rounding causes the above to have duplicated indices, while floor and ceil also cause duplicated indices.

Any suggestions on how to adjust the drift without losing too much data?

Thanks.

Can you add new `DataFrame` sample for merge with expected output? I think [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve). — jezrael, Mar 11 '20 at 07:22

jezrael · Accepted Answer · 2020-03-11T13:04:23.210

For me working DatetimeIndex.round by 20S for 20 seconds:

df.index = df.index.round('20S')
print (df)
                            col
2018-05-06 18:02:20  11500000.0
2018-05-06 18:02:40  13500000.0
2018-05-06 18:03:00  10800000.0
2018-05-06 18:08:40  11100000.0
2018-05-06 18:08:40  12000000.0
2018-05-06 18:09:20  13000000.0

Another idea is use DataFrame.resample - it create DatetimeIndex with consecutive values:

df = df.resample('20S').mean() 
print (df)

                            col
2018-05-06 18:02:20  11500000.0
2018-05-06 18:02:40  13500000.0
2018-05-06 18:03:00  10800000.0
2018-05-06 18:03:20         NaN
2018-05-06 18:03:40         NaN
2018-05-06 18:04:00         NaN
2018-05-06 18:04:20         NaN
2018-05-06 18:04:40         NaN
2018-05-06 18:05:00         NaN
2018-05-06 18:05:20         NaN
2018-05-06 18:05:40         NaN
2018-05-06 18:06:00         NaN
2018-05-06 18:06:20         NaN
2018-05-06 18:06:40         NaN
2018-05-06 18:07:00         NaN
2018-05-06 18:07:20         NaN
2018-05-06 18:07:40         NaN
2018-05-06 18:08:00         NaN
2018-05-06 18:08:20  11100000.0
2018-05-06 18:08:40  12000000.0
2018-05-06 18:09:00  13000000.0

Then is possible remove this values:

df = df.resample('20S').mean().dropna(how='all') 
print (df)
                            col
2018-05-06 18:02:20  11500000.0
2018-05-06 18:02:40  13500000.0
2018-05-06 18:03:00  10800000.0
2018-05-06 18:08:20  11100000.0
2018-05-06 18:08:40  12000000.0
2018-05-06 18:09:00  13000000.0

Thanks, but I already tried this, and since there is a drift, the duplication will occur still. I also added the frequencies I tried already to the question. — Atif, Mar 11 '20 at 07:41
@AtifRaza - Can you create some data with returned duplications? And then add expected output? Most close to real data. — jezrael, Mar 11 '20 at 07:42
I added the entries causing index duplication when using `round`. Can you please also comment if using resampling with `mean` could solve this issue? Thanks. — Atif, Mar 11 '20 at 12:41
@AtifRaza - Yes, resample with `mean` create unique `DatetimeIndex`, like `df = df.resample('20S').mean()` — jezrael, Mar 11 '20 at 12:51
Great. If you add the resampling bit to the answer it would be better. Then I can accept that. Thanks again. — Atif, Mar 11 '20 at 12:59

Adjusting and reindexing a drifting time index dataframe

1 Answers1