I have a df
with a many-leveled MultiIndex
. Early on I need to mark certain rows to keep; in subsequent sorting and processing these rows will always be kept.
I have working code, but it's not very attractive and I'm wondering if there's a prettier / more efficient way to do it.
Given a df
with a 3+ level MultiIndex
and an arbitrary number of columns, I run this code to check for duplicates in the first 2 levels of the MultiIndex
, and mark the first occurrence as the keeper:
df['keeper'] = df.index.isin(df.assign(check=df.index.get_level_values(0), check2=df.index.get_level_values(1)).drop_duplicates(subset=['check', 'check2']).index)
Here's a toy df
with resultant keeper
col:
0 keeper
lev0 lev1 lev2
1 1 1 0.696469 True
2 NaN False
2 3 0.719469 True
2 0.980764 False
3 1 NaN True
I tried reset_index
but in the end I need the MultiIndex to remain unchanged, and moving those levels to columns only to have to re-create the very large MultiIndex again afterwards seemed even less efficient than what I have.