3

I have a dataframe which I want to sort via sort_values on one column.

Problem is there are German umlaute as first letter of the words.

Like Österreich, Zürich.

Which will sort to Zürich, Österreich. It should be sorting Österreich, Zürich.

Ö should be between N and O.

I have found out how to do this with lists in python using locale and strxfrm. Can I do this in the pandas dataframe somehow directly?

Edit: Thank You. Stef example worked quite well, somehow I had Numbers where his Version did not work with my real life Dataframe example, so I used alexey's idea. I did the following, probably you can shorten this..:


df = pd.DataFrame({'location': ['Österreich','Zürich','Bern', 254345],'code':['ö','z','b', 'v']})

#create index as column for joining later
df = df.reset_index(drop=False)

#convert int to str
df['location']=df['location'].astype(str)

#sort by location with umlaute
df_sort_index = df['location'].str.normalize('NFD').sort_values(ascending=True).reset_index(drop=False)

#drop location so we dont have it in both tables
df = df.drop('location', axis=1)

#inner join on index
new_df = pd.merge(df_sort_index, df, how='inner', on='index')

#drop index as column
new_df = new_df.drop('index', axis=1)
joeagency
  • 33
  • 5
  • did you check the output of alexey's method? I get the umlaut points of Ö or ü not over the O or u but next to it over a space in IDLE. On the console, it makes a backspace to put them over the O or u, but the next column isn't aligned anymore. Firefox also shows them shifted (`Österreich` instead of `Österreich`) So you'll have to revert the normalization in the final result. – Stef Aug 06 '19 at 18:04
  • sorry I am not sure what you mean exactly, I Export my results to Excel and they seem fine? – joeagency Aug 09 '19 at 11:30

3 Answers3

3

You could use sorted with a locale aware sorting function (in my example, setlocale returned 'German_Germany.1252') to sort the column values. The tricky part is to sort all the other columns accordingly. A somewhat hacky solution would be to temporarily set the index to the column to be sorted and then reindex on the properly sorted index values and reset the index.

import functools
import locale
locale.setlocale(locale.LC_ALL, '')
df = pd.DataFrame({'location': ['Österreich','Zürich','Bern'],'code':['ö','z','b']})

df = df.set_index('location')
df = df.reindex(sorted(df.index, key=functools.cmp_to_key(locale.strcoll))).reset_index()

Output of print(df):

     location code
0        Bern    b
1  Österreich    ö
2      Zürich    z


Update for mixed type columns If the column to be sorted is of mixed types (e.g. strings and integers), then you have two possibilities:

a) convert the column to string and then sort as written above (result column will be all strings):

locale.setlocale(locale.LC_ALL, '')
df = pd.DataFrame({'location': ['Österreich','Zürich','Bern', 254345],'code':['ö','z','b','v']})
df.location=df.location.astype(str)
df = df.set_index('location')
df = df.reindex(sorted(df.index, key=functools.cmp_to_key(locale.strcoll))).reset_index()
print(df.location.values)
# ['254345' 'Bern' 'Österreich' 'Zürich']

b) sort on a copy of the column converted to string (result column will retain mixed types)

locale.setlocale(locale.LC_ALL, '')
df = pd.DataFrame({'location': ['Österreich','Zürich','Bern', 254345],'code':['ö','z','b','v']})
df = df.set_index(df.location.astype(str))
df = df.reindex(sorted(df.index, key=functools.cmp_to_key(locale.strcoll))).reset_index(drop=True)
print(df.location.values)
# [254345 'Bern' 'Österreich' 'Zürich']
Stef
  • 28,728
  • 2
  • 24
  • 52
  • @joeagency I updated my answer for mixed type columns – Stef Aug 06 '19 at 17:54
  • Hey, this works for the example thx. Since in my real world data I have duplicates I would have to make a new DF without Duplicates so I can use them as Index and Sort and then Join back etc... Would also work both Solutions work for my Data at least. – joeagency Aug 09 '19 at 11:42
1

you can use unicode NFD normal form

>>> names = pd.Series(['Österreich', 'Ost', 'S', 'N'])
>>> names.str.normalize('NFD').sort_values()
3              N
1            Ost
0    Österreich
2              S
dtype: object

# use result to rearrange a dataframe
>>> df[names.str.normalize('NFD').sort_values().index]

It's not quite what you wanted, but for proper ordering you need language knowladge (like locale you mentioned).

NFD employs two symbols for umlauts e.g. Ö becomes O\xcc\x88 (you can see the difference with names.str.normalize('NFD').encode('utf-8'))

alexey
  • 706
  • 5
  • 9
  • I think the OP asked to sort **a whole dataframe** by the value of one column, not just the values of this column. Apart from that +1 for `NFD` and `str.normalize`, didn't know it before. – Stef Aug 05 '19 at 17:00
  • you can use the resulting index to rearrange the dataframe – alexey Aug 05 '19 at 17:15
0

Sort with locale:

import pandas as pd
import locale
locale.setlocale(locale.LC_ALL, 'de_de')
#codes: https://github.com/python/cpython/blob/3.10/Lib/locale.py

#create df
df = pd.DataFrame({'location': ['Zürich','Österreich','Bern', 254345],'code':['z','ö','b','v']})
#convert int to str
df['location']=df['location'].astype(str)
#sort
df_ord = df.sort_values(by = 'location', key = lambda col: col.map(lambda x: locale.strxfrm(x)))

Multisort with locale:

import pandas as pd
import locale
locale.setlocale(locale.LC_ALL, 'es_es')

# create df
lista1 = ['sarmiento', 'ñ', 'á', 'sánchez', 'a', 'ó', 's', 'ñ', 'á', 'sánchez']
lista2 = [10, 20, 60, 40, 20, 20, 10, 5, 30, 20]
df = pd.DataFrame(list(zip(lista1, lista2)), columns = ['Col1', 'Col2'])

#sort by Col2 and Col1
df_temp = df.sort_values(by = 'Col2')
df_ord = df_temp.sort_values(by = 'Col1', key = lambda col: col.map(lambda x: locale.strxfrm(x)), kind = 'mergesort')
plrp
  • 1
  • 1