1

I am struggling with finding a way to sort the rows in a grouped pandas dataframe with two levels of indicies. This is an example of what a dataframe df I have:

enter image description here

I could not paste the table in a nice format, but here is code for creating the same:

import pandas as pd
data = {'Count': [207, 105, 28, 37, 182, 194]}
index = pd.MultiIndex.from_tuples([
    ('age_group', '20-39'),
    ('age_group', '40-59'),
    ('age_group', '<19'),
    ('age_group', '>60'),
    ('gender', 'female'),
    ('gender', 'male')
], names=['Parameter', 'Values'])
df = pd.DataFrame(data, index=index)

I would like to sort the level 1 age group based on this list: age_groups = ["<19", "20-39", "40-59", ">60"] in order to place the row with "<19" before "20-39".

I have a lot more parameters and values than shown here, but even with this simple example I cannot find a good solution.

df.reindex(age_groups, level=1) will remove the other Parameters. pd.MultiIndex.from_product() seems to require me to hard coded specify all other indicies which I will not do. All examples and documentation I find have the same level 1 indicies for all level 0.

3 Answers3

2

I'd probably go with @Andrej's approach if the categories are fixed.

If you need a programmatic solution, here is one with numpy.lexsort, it uses the last number of the range to perform the sort:

order = np.lexsort([df.index.get_level_values(1)
                      .str.extract(r'(\d+)$', expand=False)
                      .astype(float),
                    df.index.get_level_values(0)
                    ])

out = df.iloc[order]

Output:

                  Count
Parameter Values       
age_group <19        28
          20-39     207
          40-59     105
          >60        37
gender    female    182
          male      194
mozway
  • 194,879
  • 13
  • 39
  • 75
1

Kind of clunky, but I used a custom key with df.sort_index().

def sorter(vals):
    ages = ['<19', '20-39', '40-59', '>60']
    genders = ['female', 'male']
    if vals.name == 'Parameter':
        return sorted(vals)
    else:
        return [ages.index(val) if val in ages else genders.index(val) for val in vals]

df.sort_index(key = sorter)
Michael Cao
  • 2,278
  • 1
  • 1
  • 13
1

Another option is to simply select rows in correct order:

weights = ["<19", "20-39", "40-59", ">60"]

df = pd.concat([df.loc[("age_group", weights), :], df.loc[("gender",):]])
print(df)

Prints:

                  Count
Parameter Values       
age_group <19        28
          20-39     207
          40-59     105
          >60        37
gender    female    182
          male      194
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • This is a pretty nice solutions, however is comes with the warning: FutureWarning: The behavior of indexing on a MultiIndex with a nested sequence of labels is deprecated and will change in a future version. `series.loc[label, sequence]` will raise if any members of 'sequence' or not present in the index's second level. – confusedstudent Jul 18 '23 at 12:55
  • Which pandas version do you have? I'm using `pandas==1.5.3` (maybe it changed in pandas 2) – Andrej Kesely Jul 18 '23 at 13:55