Difference between pandas aggregators .first() and .last()

Question

I'm curious as to what last() and first() does in this specific instance (when chained to a resampling). Correct me if I'm wrong, but I understand if you pass arguments into first and last, e.g. 3; it returns the first 3 months or first 3 years.

In this circumstance, since I'm not passing any arguments into first() and last(), what is it actually doing when I'm resampling it like that? I know that if I resample by chaining .mean(), I'll resample into years with the mean score from averaging all the months, but what is happening when I'm using last()?

More importantly, why does first() and last() give me different answers in this context? I see that numerically they are not equal.

i.e: post2008.resample().first() != post2008.resample().last()

TLDR:

What does .first() and .last() do?
What does .first() and .last() do in this instance, when chained to a resample?
Why does .resample().first() != .resample().last()?

This is the code before the aggregation:

# Read 'GDP.csv' into a DataFrame: gdp
gdp = pd.read_csv('GDP.csv', index_col='DATE', parse_dates=True)

# Slice all the gdp data from 2008 onward: post2008
post2008 = gdp.loc['2008-01-01':,:]

# Print the last 8 rows of post2008
print(post2008.tail(8))

This is what print(post2008.tail(8)) outputs:

              VALUE
DATE               
2014-07-01  17569.4
2014-10-01  17692.2
2015-01-01  17783.6
2015-04-01  17998.3
2015-07-01  18141.9
2015-10-01  18222.8
2016-01-01  18281.6
2016-04-01  18436.5

Here is the code that resamples and aggregates by last():

# Resample post2008 by year, keeping last(): yearly
yearly = post2008.resample('A').last()
print(yearly)

This is what yearly is like when it's post2008.resample('A').last():

              VALUE
DATE               
2008-12-31  14549.9
2009-12-31  14566.5
2010-12-31  15230.2
2011-12-31  15785.3
2012-12-31  16297.3
2013-12-31  16999.9
2014-12-31  17692.2
2015-12-31  18222.8
2016-12-31  18436.5

Here is the code that resamples and aggregates by first():

# Resample post2008 by year, keeping first(): yearly
yearly = post2008.resample('A').first()
print(yearly)

This is what yearly is like when it's post2008.resample('A').first():

            VALUE
DATE               
2008-12-31  14668.4
2009-12-31  14383.9
2010-12-31  14681.1
2011-12-31  15238.4
2012-12-31  15973.9
2013-12-31  16475.4
2014-12-31  17025.2
2015-12-31  17783.6
2016-12-31  18281.6

Helder · Answer 1 · 2022-07-25T11:09:36.640

Before anything else, let's create a dataframe with example data:

import pandas as pd
dates = pd.DatetimeIndex(['2014-07-01', '2014-10-01', '2015-01-01',
                            '2015-04-01', '2015-07-01', '2015-07-01',
                            '2016-01-01', '2016-04-01'])
df = pd.DataFrame({'VALUE': range(1000, 9000, 1000)}, index=dates)
print(df)

The output will be

            VALUE
2014-07-01   1000
2014-10-01   2000
2015-01-01   3000
2015-04-01   4000
2015-07-01   5000
2015-07-01   6000
2016-01-01   7000
2016-04-01   8000

If we pass e.g. '6M' to df.first (which is not an aggregator, but a DataFrame method), we will be selecting the first six months of data, which in the example above consists of just two days:

print(df.first('6M'))

            VALUE
2014-07-01   1000
2014-10-01   2000

Similarly, last returns only the rows that belong to the last six months of data:

print(df.last('6M'))

            VALUE
2016-01-01   6000
2016-04-01   7000

In this context, not passing the required argument results in an error:

print(df.first())

TypeError: first() missing 1 required positional argument: 'offset'

On the other hand, df.resample('Y') returns a Resampler object, which has aggregation methods first, last, mean, etc. In this case, they keep only the first (respectively, last) values of each year (instead of e.g. averaging all values, or some other kind of aggregation):

print(df.resample('Y').first())

            VALUE
2014-12-31   1000
2015-12-31   3000  # This is the first of the 4 values from 2015
2016-12-31   7000

print(df.resample('Y').last())

            VALUE
2014-12-31   2000
2015-12-31   6000  # This is the last of the 4 values from 2015
2016-12-31   8000

As an extra example, consider also the case of grouping by a smaller period:

print(df.resample('M').last().head())

             VALUE
2014-07-31  1000.0  # This is the last (and only) value from July, 2014
2014-08-31     NaN  # No data for August, 2014
2014-09-30     NaN  # No data for September, 2014
2014-10-31  2000.0
2014-11-30     NaN  # No data for November, 2014

In this case, any periods for which there is no value will be filled with NaNs. Also, for this example, using first instead of last would have returned the same values, since each month has (at most) one value.

thank you for this example and explanation! some minor correction of the answer above: for the `resample('Y').first -> last value should be 7000`, for the `resample('Y').last -> last value should be 8000` — ordem, Jul 25 '22 at 02:45

Difference between pandas aggregators .first() and .last()

1 Answers1