2

I am trying to resample a timeseries data from 5 min frequency to hourly average.

 df = pd.read_csv("my_data.csv", index_col=False, usecols=['A','B','C'])

output:

         A            B      C
0   16-01-21 0:00   95.75   0.0
1   16-01-21 0:05   90.10   0.0
2   16-01-21 0:10   86.26   0.0
3   16-01-21 0:15   92.72   0.0
4   16-01-21 0:20   81.54   0.0

df.A= pd.to_datetime(df.A)

Output:

             A            B      C
0   2021-01-16 00:00:00 95.75   0.0
1   2021-01-16 00:05:00 90.10   0.0
2   2021-01-16 00:10:00 86.26   0.0
3   2021-01-16 00:15:00 92.72   0.0
4   2021-01-16 00:20:00 81.54   0.0

Now I set the Timestamp column as index,

df.set_index('A', inplace=True)

And when I try to resample with

df2 = df.resample('H').mean()

I am getting this,

                       B         C
        A       
2021-01-02 00:00:00 79.970278   0.0
2021-01-02 01:00:00 77.951667   0.0
2021-01-02 02:00:00 77.610556   0.0
2021-01-02 03:00:00 80.800000   0.0
2021-01-02 04:00:00 84.305000   0.0

Was expecting this kind of timestamp with the average values for each hour,

       A             B         C
2021-01-16 00:00:00 79.970278   0.0
2021-01-16 01:00:00 77.951667   0.0
2021-01-16 02:00:00 77.610556   0.0
2021-01-16 03:00:00 80.800000   0.0
2021-01-16 04:00:00 84.305000   0.0

I am not sure where I am making a mistake. Help me out.

  • hej @Aswin, i don't see where your problem is. When I run your code for the dataframe you have given (5 rows), it correctly outputs one average = one row as all of the timestamps fall into one hour. – Per Joachims Oct 04 '21 at 06:51
  • Hi @Per Joachim, thanks for your response. The output I am getting shows the date as 2nd January. But i actually have data from 16th January to 16th September. I am expecting to get 24 rows for one day. With the average values. for the expected values I put up there, i just changed the timestamp to reflect the dates from my original data set. – Aswin Kumar Oct 04 '21 at 07:00
  • Hi @Corralien, The difference is the timestamps, I was expecting to get hourly averages for the 16th of January. But got 2nd of January. – Aswin Kumar Oct 04 '21 at 07:02
  • hej @Aswin, when i run your code I see results for the correct timestamp. So maybe you should add the whole dataset so that we can reproduce the error – Per Joachims Oct 04 '21 at 07:06
  • I think you need change `df.A= pd.to_datetime(df.A)` to `df.A= pd.to_datetime(df.A, dayfirst=True)` – jezrael Oct 04 '21 at 07:09
  • @MrFuppes - you are right. – jezrael Oct 04 '21 at 07:24

1 Answers1

2

I think problem here is some datetimes are wrongly converted:

#default is month first in df.A= pd.to_datetime(df.A)
01-02-21 -> 2021-01-02

Possible solutions:

df.A= pd.to_datetime(df.A, dayfirst=True)

Or:

df = pd.read_csv("my_data.csv", 
                 index_col=False, 
                 usecols=['A','B','C'], 
                 parse_dates=['A'], 
                 dayfirst=True)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252