1

I have a pandas dataframe with several series and I sort this dataframe by some of them:

df = df.sort_values(by=['Col1', 'Col2', 'Col3', 'Col4', 'Col5'])

This works fine! But german umlauts are not in the "german order", so Ä is not between A and B, but instead is sorted after Z. Also "denmark" ist sorted after "Zimmermann", because the pandas sorting algorithm seems to be case-sensetive.

I found solutions for sorting a dataframe by one Series (e. g. here) but no solution for sorting by several series. In all series, umlauts are possible. So I tried a bit - maybe this helps someone. :-)


Solution:

df = df.sort_values(by=['Col1', 'Col2', 'Col3', 'Col4', 'Col5'], key=lambda col: col.str.lower().str.normalize('NFD'))

str.lower() solves the problem, that a string "denmark" ist sorted after "Zimmermann"

str.normalize('NFD') solves the problem with the german umlauts.

HFPSY
  • 31
  • 6
  • This works, but if there are mixed numbers (as string) and words, it sorts the numbers incorrectly. For example, it sorts 100 before 20. How can this be fixed? – Mike Dec 10 '21 at 13:56
  • Could you please post your 'solution' as an answer, and accept the answer? So people know it has been solved? – JeffUK Dec 10 '21 at 13:59

1 Answers1

1

Solution as an answer, so I can accept it as solved:

df = df.sort_values(by=['Col1', 'Col2', 'Col3', 'Col4', 'Col5'], key=lambda col: col.str.lower().str.normalize('NFD'))

str.lower() solves the problem, that a string "denmark" ist sorted after "Zimmermann"

str.normalize('NFD') solves the problem with the german umlauts.

HFPSY
  • 31
  • 6