1

I have a dataset with an average interval of 22.xx seconds between recordings while the median interval is 21 seconds.

I tried to use the DatetimeIndex.floor/ceil/round functions (with 20/21/22 second frequencies), but these lead to duplicated indices and I need to subsequently merge the dataframe with another one, therefore, duplicated indices have to be avoided. Dropping duplicated indices also leads to significant data loss.

I want the dataframe to use an equidistant time interval, preferably rounded to 00/20/40 seconds.

A few sample rows of the data are shown below.

2018-05-06 18:02:24    1.15e+07
2018-05-06 18:02:45    1.35e+07
2018-05-06 18:03:05    1.08e+07
...
2018-05-06 18:08:30    1.11e+07
2018-05-06 18:08:50    1.20e+07
2018-05-06 18:09:10    1.30e+07
...

Which when processed should return the following.

2018-05-06 18:02:20    1.15e+07
2018-05-06 18:02:40    1.35e+07
2018-05-06 18:03:00    1.08e+07
...
2018-05-06 18:08:20    1.11e+07
2018-05-06 18:08:40    1.20e+07
2018-05-06 18:09:00    1.30e+07
...

Rounding causes the above to have duplicated indices, while floor and ceil also cause duplicated indices.

Any suggestions on how to adjust the drift without losing too much data?

Thanks.

Atif
  • 345
  • 1
  • 4
  • 16
  • Can you add new `DataFrame` sample for merge with expected output? I think [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve). – jezrael Mar 11 '20 at 07:22
  • @jezrael I added a sample of the required version. – Atif Mar 11 '20 at 07:28

1 Answers1

2

For me working DatetimeIndex.round by 20S for 20 seconds:

df.index = df.index.round('20S')
print (df)
                            col
2018-05-06 18:02:20  11500000.0
2018-05-06 18:02:40  13500000.0
2018-05-06 18:03:00  10800000.0
2018-05-06 18:08:40  11100000.0
2018-05-06 18:08:40  12000000.0
2018-05-06 18:09:20  13000000.0

Another idea is use DataFrame.resample - it create DatetimeIndex with consecutive values:

df = df.resample('20S').mean() 
print (df)

                            col
2018-05-06 18:02:20  11500000.0
2018-05-06 18:02:40  13500000.0
2018-05-06 18:03:00  10800000.0
2018-05-06 18:03:20         NaN
2018-05-06 18:03:40         NaN
2018-05-06 18:04:00         NaN
2018-05-06 18:04:20         NaN
2018-05-06 18:04:40         NaN
2018-05-06 18:05:00         NaN
2018-05-06 18:05:20         NaN
2018-05-06 18:05:40         NaN
2018-05-06 18:06:00         NaN
2018-05-06 18:06:20         NaN
2018-05-06 18:06:40         NaN
2018-05-06 18:07:00         NaN
2018-05-06 18:07:20         NaN
2018-05-06 18:07:40         NaN
2018-05-06 18:08:00         NaN
2018-05-06 18:08:20  11100000.0
2018-05-06 18:08:40  12000000.0
2018-05-06 18:09:00  13000000.0

Then is possible remove this values:

df = df.resample('20S').mean().dropna(how='all') 
print (df)
                            col
2018-05-06 18:02:20  11500000.0
2018-05-06 18:02:40  13500000.0
2018-05-06 18:03:00  10800000.0
2018-05-06 18:08:20  11100000.0
2018-05-06 18:08:40  12000000.0
2018-05-06 18:09:00  13000000.0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks, but I already tried this, and since there is a drift, the duplication will occur still. I also added the frequencies I tried already to the question. – Atif Mar 11 '20 at 07:41
  • @AtifRaza - Can you create some data with returned duplications? And then add expected output? Most close to real data. – jezrael Mar 11 '20 at 07:42
  • I added the entries causing index duplication when using `round`. Can you please also comment if using resampling with `mean` could solve this issue? Thanks. – Atif Mar 11 '20 at 12:41
  • @AtifRaza - Yes, resample with `mean` create unique `DatetimeIndex`, like `df = df.resample('20S').mean()` – jezrael Mar 11 '20 at 12:51
  • 1
    Great. If you add the resampling bit to the answer it would be better. Then I can accept that. Thanks again. – Atif Mar 11 '20 at 12:59