pandas: mark duplicate rows using subset of MultiIndex levels, not columns

Question

I have a df with a many-leveled MultiIndex. Early on I need to mark certain rows to keep; in subsequent sorting and processing these rows will always be kept.

I have working code, but it's not very attractive and I'm wondering if there's a prettier / more efficient way to do it.

Given a df with a 3+ level MultiIndex and an arbitrary number of columns, I run this code to check for duplicates in the first 2 levels of the MultiIndex, and mark the first occurrence as the keeper:

df['keeper'] = df.index.isin(df.assign(check=df.index.get_level_values(0), check2=df.index.get_level_values(1)).drop_duplicates(subset=['check', 'check2']).index)

Here's a toy df with resultant keeper col:

                       0  keeper
lev0 lev1 lev2                  
1    1    1     0.696469    True
          2          NaN   False
     2    3     0.719469    True
          2     0.980764   False
     3    1          NaN    True

I tried reset_index but in the end I need the MultiIndex to remain unchanged, and moving those levels to columns only to have to re-create the very large MultiIndex again afterwards seemed even less efficient than what I have.

is this what you are looking for? https://stackoverflow.com/questions/71645191/pandas-multiindex-duplicated-only-for-particular-indices — rhug123, May 23 '23 at 01:48
@rhug123 indeed that is related, thanks. I could make the `to_frame` answer work, although in the end my `get_level_values` solution is probably faster (altho not tested) considering my MultiIndex is 10's of levels deep. — fantabolous, May 23 '23 at 02:07

score 2 · Accepted Answer · edited May 25 '23 at 05:18

This should work as well. It is inspired by mozway's answer on the link posted.

keep = ['lev0','lev1']

~df.index.droplevel(df.index.names.difference(keep)).duplicated()

Edit by OP:

I timed the various solutions presented (and my original code) on a 1000x1000 df with a 10-level MultiIndex, and this solution (using droplevel and names.difference) was the clear winner, by over an order of magnitude. From fastest to slowest:

%timeit ~df.index.droplevel(df.index.names.difference(['lev0', 'lev1'])).duplicated()
72.1 µs ± 248 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

%timeit ~df[[]].reset_index(level=['lev0', 'lev1']).duplicated().to_numpy()
816 µs ± 3.21 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit ~df.reset_index(level=['lev0', 'lev1']).duplicated(subset=['lev0', 'lev1']).to_numpy()
12.3 ms ± 38.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit ~df.assign(check=df.index.get_level_values(0), check2=df.index.get_level_values(1)).duplicated(subset=['check', 'check2'])
12.7 ms ± 47.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df.index.isin(df.assign(check=df.index.get_level_values('lev0'), check2=df.index.get_level_values('lev1')).drop_duplicates(subset=['check', 'check2']).index)
15.7 ms ± 59.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

fantabolous · Answer 2 · 2023-05-25T05:17:57.727

Thanks to @rhug123 and the answer he posted I improved on my original code, using duplicated instead of drop_duplicates, negating the need for .index.isin. I chose not to use to_frame because my MultiIndex is 10's of levels deep and copying all of the levels just to use 2 seemed inefficient. Both of these solutions extract only the required index levels.

3 different options that worked, from slowest to fastest:

df['keeper'] = ~df.assign(check=df.index.get_level_values(0), check2=df.index.get_level_values(1)).duplicated(subset=['check', 'check2'])
df['keeper2'] = ~df.reset_index(level=['lev0', 'lev1']).duplicated(subset=['lev0', 'lev1']).to_numpy()
df['keeper3'] = ~df[[]].reset_index(level=['lev0', 'lev1']).duplicated().to_numpy()

After reset_index, to_numpy is required because the index no longer matches the original. The 3rd solution first drops all columns using df[[]] so that subset isn't required.

pandas: mark duplicate rows using subset of MultiIndex levels, not columns

2 Answers2