I used a scraper to get comments from Facebook. Unfortunately, it converted the Umlaute "Ä" "Ü" "Ö" in German to UTF-8 literals such as "\xc3\xb6". I tried now different approaches to reconvert the files but unfortunately none of the things I have done, were successful.
for file in glob.glob("Comments/*.csv"):
rawfile=csv.reader(open(file,"rU", encoding = "ISO-8859-1"))
new_tablename=file +"converted"
new_table=csv.writer(open("%s.csv" % (new_tablename),"w"))
for row in rawfile:
for w in row:
a=str(w)
b=a.encode('latin-1').decode('utf-8')
print(b)
new_table.writerow(row)
Another approach was creating a dictionary with all the literals and the German characters but this approach did not work either.
import csv, glob, re
print("Start")
converter_table=csv.reader(open("LiteralConvert.csv","rU"))
converterdic={}
for line in converter_table:
charToFind=line[2]
charForReplace=line[1]
print(charToFind+" will be replaced by "+charForReplace)
converterdic[charToFind] = charForReplace
print(converterdic)
for file in glob.glob("Comments/*.csv"):
rawfile=csv.reader(open(file,"rU", encoding = "ISO-8859-1"))
print("opening: "+ file)
new_tablename=file +"converted"
new_table=csv.writer(open("%s.csv" % (new_tablename),"w"))
print("created clean file: " + new_tablename)
for row in rawfile:
for w in row:
#print(w)
try:
w.translate(converterdic)
except KeyError:
continue
new_table.writerow(row)
However, the first solution works fine, if I just do:
s="N\xc3\xb6 kein Schnee von gestern doch der beweis daf\xc3\xbcr das L\xc3\xbcgenpresse existiert."
b = s.encode('latin-1').decode('utf-8')
print(b)
But not when I parse in the string from a file.