Converting UTF-8 (in literal) to Umlaute

Question

I used a scraper to get comments from Facebook. Unfortunately, it converted the Umlaute "Ä" "Ü" "Ö" in German to UTF-8 literals such as "\xc3\xb6". I tried now different approaches to reconvert the files but unfortunately none of the things I have done, were successful.

for file in glob.glob("Comments/*.csv"):
    rawfile=csv.reader(open(file,"rU", encoding = "ISO-8859-1"))
    new_tablename=file +"converted"
    new_table=csv.writer(open("%s.csv" % (new_tablename),"w"))
    for row in rawfile:
        for w in row:
            a=str(w)
            b=a.encode('latin-1').decode('utf-8')
            print(b)
        new_table.writerow(row)

Another approach was creating a dictionary with all the literals and the German characters but this approach did not work either.

import csv, glob, re
print("Start")
converter_table=csv.reader(open("LiteralConvert.csv","rU"))
converterdic={}
for line in converter_table:
    charToFind=line[2]
    charForReplace=line[1]
    print(charToFind+" will be replaced by "+charForReplace)
    converterdic[charToFind] = charForReplace


print(converterdic)

for file in glob.glob("Comments/*.csv"):
        rawfile=csv.reader(open(file,"rU", encoding = "ISO-8859-1"))
    print("opening: "+ file)
    new_tablename=file +"converted"
    new_table=csv.writer(open("%s.csv" % (new_tablename),"w"))
    print("created clean file: " + new_tablename)
    for row in rawfile:
        for w in row:
            #print(w)
            try:
                w.translate(converterdic)
            except KeyError:
                continue
        new_table.writerow(row)

However, the first solution works fine, if I just do:

s="N\xc3\xb6 kein Schnee von gestern doch der beweis daf\xc3\xbcr das L\xc3\xbcgenpresse existiert."
b = s.encode('latin-1').decode('utf-8')

print(b)

But not when I parse in the string from a file.

If you have UTF-8 data, why would you read it as ISO-8859-1? — Josh Lee, Apr 18 '17 at 12:59
I am sorry, I saw it somewhere else on Stackoverflow and gave it a try. But regardless whether I set it to UTF-8 or ISO-8859-1, the core problem is that is does not replace the wrong parts such as "\xc3\xb6" with what I have stored in the dictionary. — Haluka Maier-Borst, Apr 18 '17 at 13:24
Please check if my current answer got your "problem" nailed down to the core of it - I think it does and therefore is worth to be accepted as answer. — Claudio, Apr 18 '17 at 23:16
@HalukaMaier-Borst I see an upvote on my answer, was it you? — Claudio, Apr 19 '17 at 12:32
Please accept an answer to give the question/answer cycle a defined end and to give others the chance to see that the question was answered. — Claudio, Apr 28 '17 at 07:27

Claudio · Answer 1 · 2017-04-19T13:13:30.860

I have been through all the comments and the other answer trying to understand WHERE is the problem and WHAT is the core of the problem you face. Here my conclusion from all this after many deep thoughts about it:

Frequent core of problems with encoding/decoding strings is the interpretation of what you have from what you see. In this context it is VERY IMPORTANT to understand, that:

If you have a string/text in Python (or file) you are never, ever able to see it 'as it is'.

and have to decide about the encoding/decoding scheme first.

In other words, you look ALWAYS through a filter of a given encoding/decoding on what you look at and if there is a change in the encoding/decoding scheme, it changes what you see without a change in what you look at.

Let's say the same once again, now in other 'other words": To look at a string or text in a file you MUST use some kind of tool for its VISUALIZATION ... AND ... such tool for visualization USES some kind of information about the ENCODING (implicit taking a default value or explicit by urging you to specify which coding should it use), so without encoding/decoding there is no visualization. Understanding this has an huge impact on how you think about what you see in terms of thinking what are you looking at. It is like with 3D-glasses in a cinema: wearing them does not change what is on the screen, but changes how you see it.

So if you have an UTF-8 encoded string with non-ASCII characters and look at it with tools showing you UTF-8 characters you see the German Umlaute as they are, BUT if you look at the same string using a tool for visualization of binary strings ti will show you neither the non-ASCII characters in it (it's binary, so it visualizes byte by byte and can't show non-ASCII without knowledge about the used code) nor the UTF-8 interpretation (the Umlaut are two bytes but the tool for visualization shows byte by byte) - it will show you the non-ASCII characters in the form "\xc3\xb6", BUT ... in the string/file there ARE NOT 8 bytes there - there are only TWO bytes '0xC3' and '0xB6'. This is how it comes that e.g. the print() command in order to show you what the bytes are uses "\xc3\xb6".

Hope you got now the idea what I am talking about (it's a kind of enlightenment experience after long hours/days/months of confusion), did you?

Here an excerpt from the UTF-8 table you can find the letter 'ö' in:

"""U+00F6 ö c3 b6 ö ö LATIN SMALL LETTER O WITH DIAERESIS"""

score 0 · Answer 2 · answered Apr 18 '17 at 13:32

0

You are essentially doing b'\xc3\xb6'.decode('ISO-8859-1').encode('latin-1').decode('utf8') when you do

rawfile = csv.reader(open(file,"rU", encoding = "ISO-8859-1"))
...
a = str(w)
b = a.encode('latin-1').decode('utf-8')

Skip the unnecessary .decode() and .encode() and by doing open(file, "r", encoding="utf8") to open the files instead.

answered Apr 18 '17 at 13:32

Martin Valgur

5,793
1
33
45

Dear Martin, thanks for the answer and you were right, that step was unneccessary. I just changed the settings and removed the "decode.encode"-part. But still, unfortunately, the problem keeps popping up. So sentences still look like: b'Klar Karin, weil wir auch alle die Folgen bei 21.000 Regelungen die es jetzt zu besprechen gilt absehen k\xc3\xb6nnen. Norbert hat eigentlich v\xc3\xb6llig Recht. – Haluka Maier-Borst Apr 18 '17 at 15:45
Could you upload the text file that is causing you problems anywhere? I'm not sure I could help you further otherwise. – Martin Valgur Apr 18 '17 at 15:46
@HalukaMaier-Borst The example you show is a `bytes` object (it's got a `b` prefix). If you decode this with UTF-8, you get a correct string. Confirm: `b'k\xc3\xb6nnen'.decode('utf8') == 'können'` – lenz Apr 18 '17 at 19:11

Converting UTF-8 (in literal) to Umlaute

2 Answers2

Linked