6

I have a dataframe of categories and amounts. Categories can be nested into sub categories an infinite levels using a colon separated string. I wish to sort it by descending amount. But in hierarchical type fashion like shown.

How I need it sorted

CATEGORY                            AMOUNT
Transport                           5000
Transport : Car                     4900
Transport : Train                   100
Household                           1100
Household : Utilities               600
Household : Utilities : Water       400
Household : Utilities : Electric    200
Household : Cleaning                100
Household : Cleaning : Bathroom     75
Household : Cleaning : Kitchen      25
Household : Rent                    400
Living                              250
Living : Other                      150
Living : Food                       100

EDIT: The data frame:

pd.DataFrame({
    "category": ["Transport", "Transport : Car", "Transport : Train", "Household", "Household : Utilities", "Household : Utilities : Water", "Household : Utilities : Electric", "Household : Cleaning", "Household : Cleaning : Bathroom", "Household : Cleaning : Kitchen", "Household : Rent", "Living", "Living : Other", "Living : Food"],
    "amount": [5000, 4900, 100, 1100, 600, 400, 200, 100, 75, 25, 400, 250, 150, 100]
})

Note: this is the order I want it. It may be in any arbitrary order before the sort.

EDIT2: If anyone looking for a similar solution I posted the one I settled on here: How to sort dataframe in pandas by value in hierarchical category structure

Pierce
  • 101
  • 7

4 Answers4

5

One way could be to first str.split the category column.

df_ = df['category'].str.split(' : ', expand=True)
print (df_.head())
           0          1     2
0  Transport       None  None
1  Transport        Car  None
2  Transport      Train  None
3  Household       None  None
4  Household  Utilities  None

Then get the column amount and what you want is to get the maximum amount per group based on:

  • the first column alone,
  • then the first and the second columns
  • then the first-second and third columns, ...

You can do this with groupby.transform with max, and you concat each column created.

s = df['amount']
l_cols = list(df_.columns)
dfa = pd.concat([s.groupby([df_[col] for col in range(0, lv+1)]).transform('max')
                  for lv in l_cols], keys=l_cols, axis=1)
print (dfa)
       0       1      2
0   5000     NaN    NaN
1   5000  4900.0    NaN
2   5000   100.0    NaN
3   1100     NaN    NaN
4   1100   600.0    NaN
5   1100   600.0  400.0
6   1100   600.0  200.0
7   1100   100.0    NaN
8   1100   100.0   75.0
9   1100   100.0   25.0
10  1100   400.0    NaN
11   250     NaN    NaN
12   250   150.0    NaN
13   250   100.0    NaN

Now you just need to sort_values on all columns in the right order on first 0, then 1, then 2..., get the index and use loc to order df in the expected way

dfa = dfa.sort_values(l_cols, na_position='first', ascending=False)
dfs = df.loc[dfa.index] #here you can reassign to df directly
print (dfs)
                            category  amount
0                          Transport    5000
1                    Transport : Car    4900
2                  Transport : Train     100
3                          Household    1100
4              Household : Utilities     600
5      Household : Utilities : Water     400
6   Household : Utilities : Electric     200
10                  Household : Rent     400 #here is the one difference with this data
7               Household : Cleaning     100
8    Household : Cleaning : Bathroom      75
9     Household : Cleaning : Kitchen      25
11                            Living     250
12                    Living : Other     150
13                     Living : Food     100
Ben.T
  • 29,160
  • 6
  • 32
  • 54
  • 2
    This looks a lot more approachable, thanks! I assume this list comprehension: `[df_[col] for col in range(0, lv+1)]` has something to do with the specific types accepted by `groupby()`? (Also happy that you corrected the ordering by OP :) ) – Noah Jun 12 '20 at 20:16
  • 2
    @Noah so yeah, in the groupby you can either pass columns names or index levels if they are part of the dataframe or series you used it on. But here `s` does not have this. So what you can do then is using a list of same length iterable (in this case I use one column from df_ at a time) to see where the groups are. for lv=0, then it is equivalent to `s.groupby([df_[0]])` and for lv=1, then it is `s.groupby([df_[0], df_[1]])` if you don't have many sublevels it could be done manually, but the version with loop `for col ...` is more flexible – Ben.T Jun 12 '20 at 20:32
1

I packaged @Ben. T's answer into a more generic function, hopefully this is clearer to read!

EDIT: I have made changes to the function to group by columns in order rather than one by one to address potential issues noted by @Ben. T in the comments.

import pandas as pd

def category_sort_df(df, sep, category_col, numeric_col, ascending=False):
    '''Sorts dataframe by nested categories using `sep` as the delimiter for `category_col`.
    Sorts numeric columns in descending order by default.

    Returns a copy.'''
    df = df.copy()
    try:
        to_sort = pd.to_numeric(df[numeric_col])
    except ValueError:
        print(f'Column `{numeric_col}` is not numeric!')
        raise
    categories = df[category_col].str.split(sep, expand=True)
    # Strips any white space before and after sep
    categories = categories.apply(lambda x: x.str.split().str[0], axis=1)
    levels = list(categories.columns)
    to_concat = []
    for level in levels:
        # Group by columns in order rather than one at a time
        level_by = [df_[col] for col in range(0, level+1)]
        gb = to_sort.groupby(level_by)
        to_concat.append(gb.transform('max'))
    dfa = pd.concat(to_concat, keys=levels, axis=1)
    ixs = dfa.sort_values(levels, na_position='first', ascending=False).index
    df = df.loc[ixs].copy()
    return df

Using Python 3.7.3, pandas 0.24.2

Nick C.
  • 21
  • 3
  • Hi this is interesting version. But the fact that you `groupby` on one column at a time won't give the desired result in case sub-categories are shared. Try your code by replacing `"Living : Food"` by `"Living : Car"` (I know it does not make sense with the categories here, but might happen in real case). You'll see that the last two rows won't be ordered as expected. Because the sub-category `car`, when `groupby` only on the this level only and `transform('max')`, will get the value 4900 from the category Transport, even if in the category Living it is associated to 100. hope it makes sense – Ben.T Jun 19 '20 at 12:57
  • 1
    Excellent point, thank you @Ben. T! I'm going to think about how to handle that. I hadn't noticed that yours didn't do it once column at a time. Nice job =) – Nick C. Jun 19 '20 at 13:35
  • it is what I tried to say with the bullet points in my answer, but it is hard to put words on it ;) – Ben.T Jun 19 '20 at 13:41
  • 1
    Yes, now I understand the bullet points! I have made the necessary changes to address the bug you noted. Thank you! – Nick C. Jun 19 '20 at 15:22
0

To answer my own question: I found a way. Kind of long winded but here it is.

import numpy as np
import pandas as pd


def sort_tree_df(df, tree_column, sort_column):
    sort_key = sort_column + '_abs'
    df[sort_key] = df[sort_column].abs()
    df.index = pd.MultiIndex.from_frame(
        df[tree_column].str.split(":").apply(lambda x: [y.strip() for y in x]).apply(pd.Series))
    sort_columns = [df[tree_column].values, df[sort_key].values] + [
        df.groupby(level=list(range(0, x)))[sort_key].transform('max').values
        for x in range(df.index.nlevels - 1, 0, -1)
    ]
    sort_indexes = np.lexsort(sort_columns)
    df_sorted = df.iloc[sort_indexes[::-1]]
    df_sorted.reset_index(drop=True, inplace=True)
    df_sorted.drop(sort_key, axis=1, inplace=True)
    return df_sorted


sort_tree_df(df, 'category', 'amount')

Pierce
  • 101
  • 7
0

If you don't mind adding an extra column you can extract the main category from the category and then sort by amount/main category/category, ie.:

df['main_category'] = df.category.str.extract(r'^([^ ]+)')
df.sort_values(['main_category', 'amount', 'category'], ascending=False)[['category', 'amount']]

Output:

                            category  amount
0                          Transport    5000
1                    Transport : Car    4900
2                  Transport : Train     100
11                            Living     250
12                    Living : Other     150
13                     Living : Food     100
3                          Household    1100
4              Household : Utilities     600
5      Household : Utilities : Water     400
10                  Household : Rent     400
6   Household : Utilities : Electric     200
7               Household : Cleaning     100
8    Household : Cleaning : Bathroom      75
9     Household : Cleaning : Kitchen      25

Note that this will work well only if your main categories are single words without spaces. Otherwise you will need to do it in a different way, ie. extract all non-colons and strip the trailing space:

df['main_category'] = df.category.str.extract(r'^([^:]+)')
df['main_category'] = df.main_category.str.rstrip()
mac13k
  • 2,423
  • 23
  • 34
  • 2
    This is not the order the OP and bounty's person want. By doing a descending lexical ordering on the main category, it is already not in the same order between Living and Household main category than in the question. Then there is all the order within each of the main categories that is not respected, you can see at the rows index 6 and 10, they should be exchange because `Household : Utilities : Electric` is a subcategory of `Household : Utilities` than has a bigger amount (600) than `Household : Rent` (400) and are at the same sub category level – Ben.T Jun 13 '20 at 19:42