0

I have a dataframe (index = date) that I create by appending 4 dataframes. Because of that I have duplicates in my index, with often the same day having 3 NaNs and 1 value. My goal is to upsample this dataframe to daily frequency (df = df.resample('1D)), but before that I have to drop the duplicates.

I would like to drop the time duplicates, but according to 2 conditions:

  • For the same day, if we have at least 1 value, calculate the nanmean of the values and drop the rest.
  • For the same day, if we only have NaNs, then put a NaN in the row we keep.

I guessed that using np.nanmean() would cover these two conditions (returns a NaN when there are no values, and the mean of the values otherwise).

For example:

df = pd.DataFrame({'Pt0': [nan, -42.0, nan, nan, -26.0, nan, nan, nan, 0.0, -10.0]}, 
             index=['1984-06-10 00:00:00.096000064', '1984-06-10 00:00:00.096000064',
                    '1984-07-20 00:00:00.176000000', '1984-07-20 00:00:00.176000000',
                    '1984-07-28 00:00:00.192000000', '1984-07-28 00:00:00.192000000',
                    '1984-09-06 00:00:00.080000000', '1984-09-06 00:00:00.080000000',
                    '1984-09-06 00:00:00.271999936', '1984-09-06 00:00:00.271999936'])
df = 
                                Pt0
1984-06-10 00:00:00.096000064   NaN
1984-06-10 00:00:00.096000064 -42.0
1984-07-20 00:00:00.176000000   NaN
1984-07-20 00:00:00.176000000   NaN
1984-07-28 00:00:00.192000000 -26.0
1984-07-28 00:00:00.192000000   NaN
1984-09-06 00:00:00.080000000   NaN
1984-09-06 00:00:00.080000000   NaN
1984-09-06 00:00:00.271999936   0
1984-09-06 00:00:00.271999936   -10

df_dropped = 
                               Pt0
1984-06-10 00:00:00.096000064 -42.0
1984-07-20 00:00:00.176000000   NaN
1984-07-28 00:00:00.192000000 -26.0
1984-09-06 00:00:00.080000000 -5.0

I tried with df = df.groupby('Pt0').mean().reset_index() but it ends up skipping the NaNs, I guess it would work if df.groupby() had a nanmean() function.

How could I do that ?

Nihilum
  • 549
  • 3
  • 11

1 Answers1

1

First, convert the index to datetime objects. Then you can groupby the index and transform np.nanmean; then drop_duplicates:

df.index = pd.to_datetime(df.index)
out = df.groupby(level=0)['Pt0'].transform(np.nanmean).drop_duplicates().to_frame()

Output:

                                  Pt0
0 1984-06-10 00:00:00.096000064 -42.0
1 1984-07-20 00:00:00.176000000   NaN
2 1984-07-28 00:00:00.192000000 -26.0
3 1984-09-06 00:00:00.271999936  -5.0