LXML ValueError and UTF strings

Question

I am making a little Python script for mass-editing of HTML files (replacing links to images etc.). Now, the HTML files contain some Cyrillic, that means I have to encode the string UTF-8. I replace all the links in the HTML, and type tag.set(data) and BOOM, the console displays:

ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters.

How can I fix this? I'm pretty sure that there aren't any control characters or NULL bytes. I'm using Python 2.7.11.

value = tag.get('value').encode('utf-8')
    #h = HTMLParser.HTMLParser()
    #value = h.unescape(value)
    urls = regex.finditer(value)
    if urls is None: continue
    for turl in urls:
        ufile = turl.group().rsplit('/', 1)[-1]
        value = value.replace(turl.group(), '/'+newsrc+'/'+ufile)
        #value = cgi.escape(value, True)
        value = value.replace('\0', '')
    tag.set('value', value)

Also, I am unsure about whether I need to escape and unescape the html code, the guy that wrote the html code made it messy — McLinux, Jan 10 '16 at 14:33
Please do not add requests for urgency to any of your posts here, or anywhere on the web where your readership is mostly composed of volunteers. This sort of demanding tone is likely to turn off people from answering you - I removed it deliberately for this reason. I have downvoted as a reminder. Please do not add it again - this is called "edit warring" and will usually result in a moderator flag here. — halfer, Jan 10 '16 at 16:34

score 0 · Answer 1 · answered Jan 10 '16 at 16:54

0

It's easy. You only need to remove the encode('utf-8') part. You see LXML doesn't like people messing with the character encodings of strings. Just leave it to LXML to convert text into the suitable encoding and everything will be fine. :)

answered Jan 10 '16 at 16:54

McLinux

263
1
10

LXML ValueError and UTF strings

1 Answers1