Grouping Period series values in Pandas

Question

Following on from Reading CSV file in Pandas with historical dates I have some CSV data in the form:

Object,Earliest Date
Object1,01/01/2000
Object2,01/01/1760
Object3,01/01/1520
...

which I've now read into Pandas (using Period to handle historical dates) and create a series. I'm trying to bin the series into decades, but stumbling on getting the Period values into the form groupby expects. So far I've tried (where s is the series created from_csv):

def dt_parse(s):
  try:
    d,m,y = s.split('/')
    return pd.Period(year=int(y), month=int(m), day=int(d), freq='D')
  except:
    return pd.NaT
s2 = s['Earliest Date'].apply(dt_parse) #Create Period values
pi = pd.PeriodIndex(s2)
decades = pi.groupby(pd.Grouper(freq="120M")).count()

which fails with:

 TypeError: Argument 'labels' has incorrect type (expected numpy.ndarray, got TimeGrouper)

Trying to group on it as a series:

 decades = s2.groupby(pd.Grouper(freq="120M")).count()

fails with:

 TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'

Trying to group on it as a DataFrame:

df = pd.DataFrame(s2)
decades = df.groupby(pd.Grouper(freq="120M", key='Earliest Date')).size()

fails with:

AttributeError: 'Index' object has no attribute 'to_timestamp'

Not sure how else to do it ?!

what does dt_parse do? I suspect passing `as_index=False` to df.groupby() will do what you want. — Thtu, May 03 '16 at 00:12
dt_parse returns the date as a Period (see first link). Added as_index=False to the df.groupby call (without the key arg as doesn't seem to apply), now get the same error as with a series (TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index') — Richard, May 03 '16 at 07:24

score 0 · Accepted Answer · answered May 03 '16 at 07:37

The error messages and pandas documentation will be your friends here.

I have no idea whether your date column contain strictly unique dates. If they are, it's trivial, just use it as the index and you can use pd.Grouper. Otherwise, define your own grouping function:

def grouper(ind):
    y = df.loc[ind]['Earliest Date'].year 
    return y - (y % 10)

# I'm assuming that df is the dataframe from pd.read_csv("/path/to/csv")
# and that there's a column named "earliest date" 
# that is a Period or Datetime or something with a year attribute
gb = df.groupby(by=grouper)
print(gb.size())

Wonderful, working now thanks. Quite a deep dive for a first project, definitely back to the docs! — Richard, May 03 '16 at 20:33

Grouping Period series values in Pandas

1 Answers1