UTF-8 conversion to real letter

Question

I need help with one of my projects. I'm cleaning a large set of data to bulk insert into microsoft SQL. The data is like 10million lines. But I created a script just to extract the first 1000 for cleaning assuming the rest are the same. I noticed there was a lot of UTF-8 characters, so I converted it to the nearest real character. But after I extracted it to view the first 100000 lines, I noticed there is way more UTF-8 conversions that needs to be done and i'm converting them manually which is pretty exhaustive. I was wondering if there was an easier way to do this rather than manually entering everything in. Here is my code:

import re

infile = r"C:\\Users\\Dave\\Desktop\\database\\page-links_en.txt"
outfile = r"C:\\Users\\Dave\\Desktop\\database\\Complete\\cleanedpagelinks_file.txt"

fin = open(infile)
fout = open(outfile, "w+")

rex = re.compile(r'/([^/>]+)>')

for line in fin:
#for word in delete_list:
#    line = line.replace(word, "")
line = line.replace("%C3%A9","e")
line = line.replace("%C3%B3","o")
line = line.replace("%E2%80%93","-")
line = line.replace("%C3%A6","e")
line = line.replace("%C3%A8","e")
line = line.replace("_"," ")
line = line.replace("%C3%A0","e")
line = line.replace("%C3%A1","i")
line = line.replace("%C5%82","l")
line = line.replace("%C5%84","n")
line = line.replace("%C3%BF", "y")
line = line.replace("%C3%BE", "p")
line = line.replace("%C3%BD", "y")
line = line.replace("%C3%BC", "u")
line = line.replace("%C3%BB", "u")
line = line.replace("%C3%BA", "u")
line = line.replace("%C3%B9", "o")
line = line.replace("%C3%B6", "o")
line = line.replace("%C3%B5", "o")
line = line.replace("%C3%B4", "o")
line = line.replace("%C3%B3", "o")
line = line.replace("%C3%B2", "o")
line = line.replace("%C3%B1", "n")
line = line.replace("%C3%B0", "e")
line = line.replace("%C3%AC", "i")
line = line.replace("%C3%AD", "i")
line = line.replace("%C3%AE", "i")
line = line.replace("%C3%AF", "i")
line = line.replace("%C3%81","A")
line = line.replace("%C3%82","A")
line = line.replace("%C3%83","A")
line = line.replace("%C3%84","A")
line = line.replace("%C3%85","A")
line = line.replace("%C3%86","AE")
line = line.replace("%C3%87","C")
line = line.replace("%C3%88","E")
line = line.replace("%C3%89","E")
line = line.replace("%C3%8A","E")
line = line.replace("%C3%8B","E")
line = line.replace("%C3%8C","I")
line = line.replace("%C3%8D","I")
line = line.replace("%C3%8E","I")
line = line.replace("%C3%8F","I")
line = line.replace("%C3%90","D")
line = line.replace("%C3%91","N")
line = line.replace("%C3%92","O")
line = line.replace("%C3%93","O")
line = line.replace("%C3%94","O")
line = line.replace("%C3%95","O")
line = line.replace("%C3%96","O")
line = line.replace("%C3%98","O")
line = line.replace("%C3%99","U")
line = line.replace("%C3%9A","U")
line = line.replace("%C3%9B","U")
line = line.replace("%C3%9C","U")
line = line.replace("%C3%9D","Y")
line = line.replace("%C3%9F","B")
line = line.replace("%C3%a0","a")
line = line.replace("%C3%a1","a")
line = line.replace("%C3%a2","a")
line = line.replace("%C3%a3","a")
line = line.replace("%C3%a4","a")
line = line.replace("%C3%a5","a")
line = line.replace("%C3%a6","ae")
line = line.replace("%C3%a7","c")
line = line.replace("%C3%a8","e")
line = line.replace("%C3%a9","e")
line = line.replace("%C3%aa","e")
line = line.replace("%C3%ab","e")


match = rex.search(line)
if match:
    newline = match.group(1)
else: newline = ''
fout.write(newline + '\n')
fin.close()
fout.close()

As you can see in my code I'm manually replacing to the real character value. Here's an example line in my text file that I realized needs to be converted.

B%E1%BA%A3o %C4%90%E1%BA%A1i

can you give the part of file for test? or try to use the line.decode('utf-8').encode('iso-8859-1') — Konstantin Purtov, Mar 05 '16 at 18:34
What is the encoding of the input and what should the encoding of the output be? — pp_, Mar 05 '16 at 18:41
I tried using line.decode('utf-8').encode('iso-8859-1') , but it says AttributeError: 'str' object has no attribute 'decode' Also here is a sample of some lines I just pulled out with the unicodes Gotterd%C3%A4mmerung Gurmukh%C4%AB alphabet Hez%C3%A2rfen Ahmed Celebi Hrad%C4%8Dany Thich Nh%E1%BA%A5t H%E1%BA%A1nh La%E1%B9%85k%C4%81vat%C4%81ra S%C5%ABtra Hu%E1%BA%BF http://pastebin.com/wckCENCc I posted a portion of it on pastebin, lines 89, 132 and 153 — DavidA, Mar 05 '16 at 18:53
What I mean by real letter is for example c3 bf is' ÿ', but I just put 'y' @schwobaseggl — DavidA, Mar 05 '16 at 18:56
This is the input: Gotterd%C3%A4mmerung and the output should be: Gotterd ämmerung, @pp_ — DavidA, Mar 05 '16 at 19:15
Also your pathnames are wrong - you should *either* use `r''` strings *or* double \\, not both. — Antti Haapala -- Слава Україні, Mar 05 '16 at 19:47

score 3 · Answer 1 · answered Mar 05 '16 at 19:43

You could use unidecode with urllib.parse.unquote :

In [8]: from unidecode import  unidecode

In [9]: from urllib.parse import unquote

In [10]: unidecode(unquote("Gotterd%C3%A4mmerung"))
Out[10]: 'Gotterdammerung'

unidecode will convert the non ascii characters to their ascii equivalent.

score 2 · Answer 2 · answered Mar 05 '16 at 19:44

You can use urllib.parse.unquote. It assumes UTF-8 by default, but in case there are also urls from other codecs among there, you can use some autodetection:

from urllib.parse import unquote

def cleanup(url):
    try:
        return unquote(url, errors='strict')
    except UnicodeDecodeError:
        return unquote(url, encoding='latin-1')

and B%E1%BA%A3o %C4%90%E1%BA%A1i was the last emperor of Vietnam:

>>> cleanup('B%E1%BA%A3o %C4%90%E1%BA%A1i')
'Bảo Đại'

If you want to convert these to the ASCII equivalents, you can use the unidecode:

>>> unidecode.unidecode('Bảo Đại')
'Bao Dai'

score 0 · Accepted Answer · answered Mar 05 '16 at 21:31

Thank you everyone, I ended up getting it to work. I had to install the unidecode module which took me forever to figure out cause I was running into pip and cmd prompt errors. After the package installed I added this line and it worked.

line = cleanup(line)
line = unidecode(line)

I really appreciate the help!

score -1 · Answer 4 · edited Mar 05 '16 at 20:19

-1

As far as my understanding goes, this is URL encoded i.e encoding the characters so that you can pass as parameter to server.

Use unquote_plus() from urllib:

s1 = u'B%E1%BA%A3o %E1%BA%A1i'
print urllib.unquote_plus(s1)

Output:

Báº£o áº¡i

edited Mar 05 '16 at 20:19

dur

15,689
25
79
125

answered Mar 05 '16 at 19:21

Varun Joshi

1
2

1

I tried this, got invalid syntax error, tried other methods too and still same error @VarunJoshi – DavidA Mar 05 '16 at 19:42

UTF-8 conversion to real letter

4 Answers4