1

I have a dataframe (df, time as index and 1 column 'Pt0') that I want to upsample and interpolate with "nearest neighbor" method. I have 2 issues:

  1. When I compute df = df.upsample('1D'), I get an object core.resample.DatetimeIndexResampler which keeps me from recovering the values of my column (but I can get the index) while I only want a dataframe as output. What I don't understand is that applying this command to other dataframes usually gives me a dataframe, not that "core" object.
  2. If I apply the upsampling and interpolation directly: df = df.resample('1D').interpolate(method='nearest') I only obtain NaNs while before I had NaNs and values.

I don't understand what I am doing wrong, and I wasn't able to understand why a "core" object is created while this same method (df.resample('1D')) gave me dataframes in other cases. How can I solve this problem ?

Ps: df does not have duplicates in the index because it was computed specifically to avoid any (Pandas drop duplicates and replace the value by the nanmean of the duplicates).

Here is the dataframe:

df
Out[174]: 
                                 Pt0
1984-06-10 00:00:00.096000064  -42.0
1984-07-20 00:00:00.176000000    NaN
1984-07-28 00:00:00.192000000  -26.0
1984-10-08 00:00:00.336000064  -12.0
1984-10-16 00:00:00.352000000   -5.0
                             ...
2021-04-05 08:48:28.559141120 -248.0
2021-04-05 08:48:29.059141120 -318.0
2021-04-19 20:36:46.060141056 -311.0
2021-05-04 03:02:44.279659008 -254.0
2021-05-29 02:55:17.930625024 -286.0

[529 rows x 1 columns]

Code to reproduce my issues:

    df = pd.DataFrame({'Pt0': [np.nan, -42.0, np.nan, np.nan, -26.0, np.nan, np.nan, np.nan, 0.0, -10.0]}, 
                 index=['1984-06-10 00:00:00.096000064', '1984-06-10 00:00:00.096000064',
                        '1984-07-20 00:00:00.176000000', '1984-07-20 00:00:00.176000000',
                        '1984-07-28 00:00:00.192000000', '1984-07-28 00:00:00.192000000',
                        '1984-09-06 00:00:00.080000000', '1984-09-06 00:00:00.080000000',
                        '1984-09-06 00:00:00.271999936', '1984-09-06 00:00:00.271999936'])
    df.index = pd.to_datetime(df.index)
    df = df.groupby(level=0)['Pt0'].transform(np.nanmean).drop_duplicates().to_frame()
    df2 = df.resample('1D')
    df3 = df.resample('1D').interpolate('nearest')
Nihilum
  • 549
  • 3
  • 11

1 Answers1

1

You're doing nothing wrong but you missed one step before. You need to align your source index to your target index (days): 1984-06-10 00:00:00.096000064 is not equal to 1984-06-10 i.e. 1984-06-10 00:00:00.000000000. That's why you can see your original values in the Resampler object but not in the final result:

>>> list(df.resample('D'))[0]
(Timestamp('1984-06-10 00:00:00', freq='D'),
                                 Pt0
 1984-06-10 00:00:00.096000064 -42.0)

What you need is to apply an operation between resample and interpolate. In the present case, take the first value is sufficient but imagine you have multiple values for a same day, how Pandas can guess which value to pick?. In fact, before upsampling to get extra days you need to downsampling your intraday values.

Step-1:

>>> df.resample('D').first()  # or mean() or whatever you want
             Pt0
1984-06-10 -42.0  # <- now the index and values are aligned
1984-06-11   NaN
1984-06-12   NaN
1984-06-13   NaN
1984-06-14   NaN
...          ...
1984-09-02   NaN
1984-09-03   NaN
1984-09-04   NaN
1984-09-05   NaN
1984-09-06  -5.0

[89 rows x 1 columns]

Step-2:

>>> df.resample('D').first().interpolate('time')  # or nearest or ...
                  Pt0
1984-06-10 -42.000000
1984-06-11 -41.666667
1984-06-12 -41.333333
1984-06-13 -41.000000
1984-06-14 -40.666667
...               ...
1984-09-02  -7.100000
1984-09-03  -6.575000
1984-09-04  -6.050000
1984-09-05  -5.525000
1984-09-06  -5.000000

[89 rows x 1 columns]

You can do the same with groupby which can be more intuitive at all because it's more natural: I want to group my values per day then compute the mean and finally fill missing values by interpolate with a X method

>>> df.groupby(pd.Grouper(freq='D')).mean().interpolate('time')
                  Pt0
1984-06-10 -42.000000
1984-06-11 -41.666667
1984-06-12 -41.333333
1984-06-13 -41.000000
1984-06-14 -40.666667
...               ...
1984-09-02  -7.100000
1984-09-03  -6.575000
1984-09-04  -6.050000
1984-09-05  -5.525000
1984-09-06  -5.000000

[89 rows x 1 columns]
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • 2
    That was an interesting problem. I hope my explanation are clear enough :) – Corralien Jan 29 '22 at 02:40
  • Thank you very much ! It works very well and I really appreciate the detailed explanation :) I have one more question though: by using "df.resample.first()", how come in the first iteration of df, we don't end up with a NaN for 1984-06-10 ? Because when using df.drop_duplicates(), the "first" argument was considering the NaN as the first value, and dumping the value "-42.0". – Nihilum Jan 29 '22 at 18:54
  • 1
    In `drop_duplicated`, the "first" argument means the first row encountered whatever the value. For the "first" method of `Resampler` or `Grouper` it's like `.dropna().first()`. Here I used `first` because you have only one value, in real life, `mean` is better even if you have only one value. – Corralien Jan 29 '22 at 19:27