11

I have just discovered the power of Pandas and I love it, but I can't figure out this problem:

I have a DataFrame df.head():

   lon   lat  h  filename                  time
0  19.961216  80.617627    -0.077165     60048 2002-05-15 12:59:31.717467
1  19.923916  80.614847    -0.018689     60048 2002-05-15 12:59:31.831467
2  19.849396  80.609257    -0.089205     60048 2002-05-15 12:59:32.059467
3  19.830776  80.607857     0.076485     60048 2002-05-15 12:59:32.116467
4  19.570708  80.588183     0.162943     60048 2002-05-15 12:59:32.888467

I would like to group my data into nine day intervals

gb = df.groupby(pd.TimeGrouper(key='time', freq='9D'))

The first group:

2002-05-15 12:59:31.717467       lon   lat  h filename                  time
0    19.961216  80.617627    -0.077165     60048 2002-05-15 12:59:31.717467
1    19.923916  80.614847    -0.018689     60048 2002-05-15 12:59:31.831467
2    19.849396  80.609257    -0.089205     60048 2002-05-15 12:59:32.059467
3    19.830776  80.607857     0.076485     60048 2002-05-15 12:59:32.116467
...

Next group:

2002-05-24 12:59:31.717467        lon   lat  height  filename                  time
815   18.309498  80.457024     0.187387     60309 2002-05-24 16:35:39.553563
816   18.291458  80.458514     0.061446     60309 2002-05-24 16:35:39.610563
817   18.273408  80.460014     0.129255     60309 2002-05-24 16:35:39.667563
818   18.255358  80.461504     0.046761     60309 2002-05-24 16:35:39.724563
...

So the data are grouped in nine days counting from the first time ( 12:59:31.717467), and not from the beginning of the day as I would like.

When grouping by one day:

gb = df.groupby(pd.TimeGrouper(key='time', freq='D'))

gives me:

2002-05-15 00:00:00       lon   lat  h  filename                  time
0    19.961216  80.617627    -0.077165     60048 2002-05-15 12:59:31.717467
1    19.923916  80.614847    -0.018689     60048 2002-05-15 12:59:31.831467
2    19.849396  80.609257    -0.089205     60048 2002-05-15 12:59:32.059467
3    19.830776  80.607857     0.076485     60048 2002-05-15 12:59:32.116467
...

I can just loop over the days until I get a nine day interval, but I think it could be done smarter, I am looking for a Grouper freq option equivalent to YS (start of year) just for days, a way of setting the start time (maybe by the Grouper option convention : {‘start’, ‘end’, ‘e’, ‘s’}), or???

I am running Python 3.5.2 and Pandas is in version: 0.19.0

user1643523
  • 263
  • 1
  • 4
  • 13
  • [This answer](http://stackoverflow.com/a/22528074/754456) suggests adding the parameter `closed='left'` might do it? – mfitzp Nov 11 '16 at 14:38
  • I have tried it, but does not change a thing – user1643523 Nov 11 '16 at 14:43
  • Does `convention='s'` do anything? The documentation is seriously lacking on what the parameters to `TimeGrouper` do. – mfitzp Nov 11 '16 at 14:44
  • If you can't get it to work correctly, another option would be to convert the column to dates (not datetimes) which will drop the time component (rounding down to midnight on that day). – mfitzp Nov 11 '16 at 14:46
  • Yes, I have not been able to find an example of using `convention` . I have just tried using `convention='s'`, but without any luck. – user1643523 Nov 11 '16 at 14:50
  • Yes, I could maybe change the date format, but I need the time component, so I would need to make a new column. Thanks, it may be the best option? – user1643523 Nov 11 '16 at 14:54
  • convention is for `PeriodIndex` this would not apply in your case. – Steven G Nov 11 '16 at 15:09

4 Answers4

3

Dropping first time row:

Your best bet would be to normalize the first row of the datetime column so that the time is reset to 00:00:00(midnight) and group according to the 9D interval:

df.loc[0, 'time'] = df['time'].iloc[0].normalize()
for _, grp in df.groupby(pd.TimeGrouper(key='time', freq='9D')):
    print (grp)

#          lon        lat         h  filename                       time
# 0  19.961216  80.617627 -0.077165     60048 2002-05-15 00:00:00.000000
# 1  19.923916  80.614847 -0.018689     60048 2002-05-15 12:59:31.831467
# 2  19.849396  80.609257 -0.089205     60048 2002-05-15 12:59:32.059467
# 3  19.830776  80.607857  0.076485     60048 2002-05-15 12:59:32.116467
# 4  19.570708  80.588183  0.162943     60048 2002-05-15 12:59:32.888467
# ......................................................................

This restores the time in the other rows and so you wouldn't lose that information.


Keeping first time row:

If you want to keep the first time row as it is and not make any changes to it, but only want to start grouping from midnight onwards, you could do:

df_t_shift = df.shift()    # Shift one level down
df_t_shift.loc[0, 'time'] = df_t_shift['time'].iloc[1].normalize()
# Concat last row of df with the shifted one to account for the loss of row
df_t_shift = df_t_shift.append(df.iloc[-1], ignore_index=True)  

for _, grp in df_t_shift.groupby(pd.TimeGrouper(key='time', freq='9D')):
    print (grp)

#          lon        lat         h  filename                       time
# 0        NaN        NaN       NaN       NaN 2002-05-15 00:00:00.000000
# 1  19.961216  80.617627 -0.077165   60048.0 2002-05-15 12:59:31.717467
# 2  19.923916  80.614847 -0.018689   60048.0 2002-05-15 12:59:31.831467
# 3  19.849396  80.609257 -0.089205   60048.0 2002-05-15 12:59:32.059467
# 4  19.830776  80.607857  0.076485   60048.0 2002-05-15 12:59:32.116467
# 5  19.570708  80.588183  0.162943   60048.0 2002-05-15 12:59:32.888467
Nickil Maveli
  • 29,155
  • 8
  • 82
  • 85
1

If you truncate the datetimes to midnight of the given day, the grouping will work as expected (starting at the beginning of the day). I expected it to work by converting to datetimes, e.g

df['date'] = df['time'].apply(lambda x:x.date())

However, you can not use TimeGrouper unless the index is a datetime. You instead have two options, either truncate the datetimes to midnight directly as follows:

df['date'] = df['time'].apply(lambda x:x.replace(hour=0, minute=0, second=0, microsecond=0)))

Alternatively, you can first generate date values, then convert them back to datetimes, using the pd.to_datetime() function:

df['date'] = df['time'].apply(lambda x: x.date() )
df['date'] = pd.to_datetime(df['date'])
mfitzp
  • 15,275
  • 7
  • 50
  • 70
1

completing @mfitzp answer you could do this:

df['dateonly'] = df['time'].apply(lambda x: x.date())

only problem with that is df['dateonly'] would not be a DatetimeIndex

you need to convert it first:

df['dateonly'] = pd.to_datetime(df['dateonly'])

now you can group on it

gb = df.groupby(pd.TimeGrouper(key='dateonly', freq='9D'))

and for extra information conventionis used with PeriodIndexnot DatetimeIndex

Steven G
  • 16,244
  • 8
  • 53
  • 77
1

I know this is a late entry. But according to the pandas.Grouper documentation, you can set the origin option of the pandas.Grouper object to "start_day", as in:

for _, grp in df.groupby(pd.Grouper(freq="9D", origin="start_day")):
    print(grp)

This option was added in the 1.1.0 version.

schardong
  • 11
  • 3