0

I have a bunch of CSV files that I read and plot with python and pandas.

To add some more information about the file (or rather, the data it is about) into my plots, I am analyzing their headers, to extract various things from it (location of the measurement point, type of measurement etc.).

Problem is - the files are in German and thus contain a lot of umlauts (ü, ö, ä). Now I can read and understand them perfectly fine, but my script can't.

So I want to simply replace them with their valid 2 character representations (ü=ue, …), so that I dont have to worry about using things like u'Ümlautstring' or \xfcstring in python.

sed -i 's/\ä/ae/g' myfile.csv

should do the trick, according to google, but it doesnt work.

With some further resarch, I found the issue, but no solution:

My csv files are encoded in ISO 8859-15, but my locale is LANG=de_DE.UTF-8, which, as far as I understand it, means that sed searches for ü in its utf 8 form, which it will not find in ISO 8859-15.

So what do I have to tell sed to find my umlauts?

Most things I have found so far suggest Perl, but that is not really an option.

JC_CL
  • 2,346
  • 6
  • 23
  • 36
  • The proper solution is to decode the 8859-15 and normalize the resulting Unicode to a suitable format. `import unicodedata; decoded_normalized = unicodedata.normalize('NFKD', open(file, 'r').read().decode('iso-8859-15'))` – tripleee Feb 19 '15 at 10:55

1 Answers1

3

You can use the LC_* envvars to prevent sed from doing any UTF-8 interpretation and \x escape sequences to specify the umlaut characters by their hex value in ISO-8859-15. Long story short,

LC_ALL=C sed 's/\xc4/Ae/g;s/\xd6/Oe/g;s/\xdc/Ue/g;s/\xe4/ae/g;s/\xf6/oe/g;s/\xfc/ue/g;s/\xdf/ss/g' filename

should work for all of ÄÖÜäöüß, which I'm guessing are the ones you care about.

Wintermute
  • 42,983
  • 5
  • 77
  • 80
  • 1
    ... until somebody measures something in Angoulême or São Paulo. – tripleee Feb 19 '15 at 10:58
  • Thank you! That works. But as far as I understand it, that is ONLY going to work for ISO 8859-15 files, and UTF-8 will fail? I guess it would be wise to also look for the files encoding in my python script, so that i can switch to an UTF-8 fallback (or rather, fallup?) should i have such a file? – JC_CL Feb 19 '15 at 10:59
  • Or you could `iconv` everything into UTF-8 before processing. – tripleee Feb 19 '15 at 11:00
  • 1
    @JC_CL You should not let this run on a UTF-8 file, that is correct. If you're unsure of the encoding your file has, use a tool like `recode` to bring it into a known encoding and work with that. (`iconv` requires you to know the source encoding, whereas `recode` guesses it for you). And if you do that anyway, you can forego the special iso8859-15 handling, although you'll still have to know the code points you want to replace and with what to replace them. Or you could make the script work with UTF-8 data. – Wintermute Feb 19 '15 at 11:08
  • @tripleee Good point! Currently I wont need it, but I'll keep it in mind, when I work with international data. Wouldn't it just be simply adding `s/\xe3/a/g;` to the sed line? If I understand it correctly, its just `\x` + `HEX` from [this list](http://www.pjb.com.au/comp/diacritics.html). – JC_CL Feb 19 '15 at 11:09