How to filter pandas dataframe so that the first and last rows within a group are retained?

Question

I have a dataframe like below:

data = [
    [123456, "2017", 150.235],
    [123456, "2017", 160],
    [123456, "2017", 135],
    [123456, "2017", 135],
    [123456, "2017", 135],
    [123456, "2018", 202.5],
    [123456, "2019", 168.526],
    [123456, "2020", 175.559],
    [123456, "2020", 176],
    [123456, "2021", 206.667],
    [789101, "2017", 228.9],
    [789101, "2018", 208],
    [789101, "2018", 208],
    [789101, "2018", 208],
]

df = pd.DataFrame(
    data,
    columns=[
        "ID",
        "year",
        "value",
    ],
)
df

In this dataframe I have an ID column and 2+ years. The year columns can contain 1 or more value columns.

I would like to filter this dataframe so that all of the earliest year rows (even if there are duplicate values) and all of the latest year rows (again, even if there are duplicate values I want them).

My desired output is:

I found another SO question that was similar:

g = df.groupby("ID")

(pd.concat([g.head(1), g.tail(1)])
   .drop_duplicates()
   .sort_values('ID')
   .reset_index(drop=True))

but it only first to the first value within the first year and I want all of the values.

Can anyone please advise?!

Thank you !!

score 2 · Answer 1 · answered Aug 25 '23 at 19:55

You can figure out the "first" and "last" years per group, i.e., per ID. Then broadcast those with transform, and compare with the entire year column to filter out:

# Variabilize because we transform twice
>>> g = df.groupby("ID")["year"]

# These are first & last years per group;
# with `transform` they are "broadcasted" (repeated) to match column length
>>> firsts = g.transform("first")
>>> lasts  = g.transform("last")

# if a year equals to groups' firsts OR lasts, keep them
>>> df.loc[df["year"].eq(firsts) | df["year"].eq(lasts)]

to get

        ID  year    value
0   123456  2017  150.235
1   123456  2017  160.000
2   123456  2017  135.000
3   123456  2017  135.000
4   123456  2017  135.000
9   123456  2021  206.667
10  789101  2017  228.900
11  789101  2018  208.000
12  789101  2018  208.000
13  789101  2018  208.000

score 1 · Accepted Answer · answered Aug 25 '23 at 19:56

Try:

out = df.groupby("ID", group_keys=False).apply(
    lambda x: x[(x.year == x.year.min()) | (x.year == x.year.max())]
)
print(out)

Prints:

        ID  year    value
0   123456  2017  150.235
1   123456  2017  160.000
2   123456  2017  135.000
3   123456  2017  135.000
4   123456  2017  135.000
9   123456  2021  206.667
10  789101  2017  228.900
11  789101  2018  208.000
12  789101  2018  208.000
13  789101  2018  208.000

score 0 · Answer 3 · answered Aug 25 '23 at 20:00

Here's what you need:

import pandas as pd

# Your data
data = [
    [123456, "2017", 150.235],
    [123456, "2017", 160],
    [123456, "2017", 135],
    [123456, "2017", 135],
    [123456, "2017", 135],
    [123456, "2018", 202.5],
    [123456, "2019", 168.526],
    [123456, "2020", 175.559],
    [123456, "2020", 176],
    [123456, "2021", 206.667],
    [789101, "2017", 228.9],
    [789101, "2018", 208],
    [789101, "2018", 208],
    [789101, "2018", 208],
]

# Create a DataFrame
df = pd.DataFrame(
    data,
    columns=[
        "ID",
        "year",
        "value",
    ],
)
# Find the earliest and latest years for each "ID"
earliest_years = df.groupby("ID")["year"].min()
latest_years = df.groupby("ID")["year"].max()

# Filter the DataFrame to include earliest and latest year rows
filtered_df = df[df.apply(lambda row: row["year"] in [earliest_years[row["ID"]], latest_years[row["ID"]]], axis=1)].reset_index(drop=True)

print(filtered_df)

Output:

       ID  year    value
0  123456  2017  150.235
1  123456  2017  160.000
2  123456  2017  135.000
3  123456  2017  135.000
4  123456  2017  135.000
5  123456  2021  206.667
6  789101  2017  228.900
7  789101  2018  208.000
8  789101  2018  208.000
9  789101  2018  208.000

score 0 · Answer 4 · answered Aug 25 '23 at 20:04

import pandas as pd

data = [
[123456, "2017", 150.235],
[123456, "2017", 160],
[123456, "2017", 135],
[123456, "2017", 135],
[123456, "2017", 135],
[123456, "2018", 202.5],
[123456, "2019", 168.526],
[123456, "2020", 175.559],
[123456, "2020", 176],
[123456, "2021", 206.667],
[789101, "2017", 228.9],
[789101, "2018", 208],
[789101, "2018", 208],
[789101, "2018", 208],
]

df = pd.DataFrame(
data,
columns=[
    "ID",
    "year",
    "value",
],
)

 # Convert 'year' column to datetime type for sorting
 df['year'] = pd.to_datetime(df['year'])

 # Sort the DataFrame by 'ID' and 'year'
 df_sorted = df.sort_values(by=['ID', 'year', 'value'])

 # Get the first and last rows within each group
 first_rows = df_sorted.groupby('ID').head(1) 
 last_rows = df_sorted.groupby('ID').tail(1)

  # Concatenate the first and last rows
  result = pd.concat([first_rows, last_rows])

  print(result)

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). — Diego Borba, Aug 28 '23 at 12:15

score 0 · Answer 5 · answered Aug 25 '23 at 20:25

Using a merge to include an arbitrary number of groupby.agg conditions:

out = df.merge(
    df.groupby('ID')['year'].agg(['min', 'max'])
      .stack().droplevel(1)
      .reset_index(name='year')
      .drop_duplicates() # optional
 )

Output:

       ID  year    value
0  123456  2017  150.235
1  123456  2017  160.000
2  123456  2017  135.000
3  123456  2017  135.000
4  123456  2017  135.000
5  123456  2021  206.667
6  789101  2017  228.900
7  789101  2018  208.000
8  789101  2018  208.000
9  789101  2018  208.000

score 0 · Answer 6 · answered Aug 26 '23 at 02:03

Here is an option using nsmallest() and nlargest():

g = df.groupby('ID')['year']

df.loc[g.nsmallest(1,keep='all').droplevel(0).index.union(g.nlargest(1,keep='all').droplevel(0).index)]

or using rank():

g = df.groupby('ID')['year']

df.loc[g.rank(method = 'dense').eq(1) | g.rank(method = 'dense',ascending=False).eq(1)]

Output:

        ID  year    value
0   123456  2017  150.235
1   123456  2017  160.000
2   123456  2017  135.000
3   123456  2017  135.000
4   123456  2017  135.000
9   123456  2021  206.667
10  789101  2017  228.900
11  789101  2018  208.000
12  789101  2018  208.000
13  789101  2018  208.000

How to filter pandas dataframe so that the first and last rows within a group are retained?

6 Answers6