1

I am trying to resample my df, however using the grouper function several values are being left out in the resampling process: I have some hierarchical data from 2003 to 2020 which bottoms out into time series data which looks something like this:

polar_temp
         Station_Number      Date  Value
417         CA002100805  20030101   -296
423         CA002202570  20030101   -269
425         CA002203058  20030101   -268
427         CA002300551  20030101    -23
428         CA002300902  20030101   -200

I set a multi index on Station_Number and Date:

polar_temp['Date'] = pd.to_datetime(polar_temp['Date'],
                     format='%Y%m%d')#.dt.strftime("%Y-%m-%d")
polar_temp = polar_temp.set_index(['Station_Number', "Date"])

                           Value
Station_Number Date             
CA002100805    2003-01-01   -296
CA002202570    2003-01-01   -269
CA002203058    2003-01-01   -268
CA002300551    2003-01-01    -23
CA002300902    2003-01-01   -200

Now I would like to perform a resampling of the data by calculating the mean of Value for every 8 days by using:

polar_temp8d = polar_temp.groupby([pd.Grouper(level='Station_Number'),
                                    pd.Grouper(level='Date', freq='8D')]).mean()

                                Value
Station_Number Date                  
CA002100805    2003-01-01 -300.285714
               2003-01-09 -328.750000
               2003-01-17 -325.500000
               2003-01-25 -385.833333
               2003-02-02 -194.428571
...                               ...
USW00027515    2005-06-23   76.625000
               2005-07-01   42.375000
               2005-07-09   94.500000
               2005-07-17   66.500000
               2005-07-25   56.285714

The problem is that there are only approx. around 60.000 values being returned, however the input df has around 1 Million values. I have tried the same procedure for only the years 2003 to 2011 and again only got a return of approx. 60.000. Thus my questions:

  • Did I use the grouper function wrong?
  • Is the problem perhaps due to leap years?
  • Or is there another way to resample the data?
tillwss
  • 25
  • 5
  • How about calculating the number of each station number? Are they all the same? What is the total number of days between 2003 and 2021? One eighth of the number of days x the number of station numbers is the total number of lines. – r-beginners Jun 05 '21 at 04:52

0 Answers0