Sort Pandas Dataframe which contains german umlauts [Solution]

Question

I have a pandas dataframe with several series and I sort this dataframe by some of them:

df = df.sort_values(by=['Col1', 'Col2', 'Col3', 'Col4', 'Col5'])

This works fine! But german umlauts are not in the "german order", so Ä is not between A and B, but instead is sorted after Z. Also "denmark" ist sorted after "Zimmermann", because the pandas sorting algorithm seems to be case-sensetive.

I found solutions for sorting a dataframe by one Series (e. g. here) but no solution for sorting by several series. In all series, umlauts are possible. So I tried a bit - maybe this helps someone. :-)

Solution:

df = df.sort_values(by=['Col1', 'Col2', 'Col3', 'Col4', 'Col5'], key=lambda col: col.str.lower().str.normalize('NFD'))

str.lower() solves the problem, that a string "denmark" ist sorted after "Zimmermann"

str.normalize('NFD') solves the problem with the german umlauts.

This works, but if there are mixed numbers (as string) and words, it sorts the numbers incorrectly. For example, it sorts 100 before 20. How can this be fixed? — Mike, Dec 10 '21 at 13:56
Could you please post your 'solution' as an answer, and accept the answer? So people know it has been solved? — JeffUK, Dec 10 '21 at 13:59

score 1 · Accepted Answer · answered Dec 11 '21 at 14:22

Solution as an answer, so I can accept it as solved:

df = df.sort_values(by=['Col1', 'Col2', 'Col3', 'Col4', 'Col5'], key=lambda col: col.str.lower().str.normalize('NFD'))

str.lower() solves the problem, that a string "denmark" ist sorted after "Zimmermann"

str.normalize('NFD') solves the problem with the german umlauts.

Sort Pandas Dataframe which contains german umlauts [Solution]

1 Answers1