-1

I am making a little Python script for mass-editing of HTML files (replacing links to images etc.). Now, the HTML files contain some Cyrillic, that means I have to encode the string UTF-8. I replace all the links in the HTML, and type tag.set(data) and BOOM, the console displays:

ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters.

How can I fix this? I'm pretty sure that there aren't any control characters or NULL bytes. I'm using Python 2.7.11.

value = tag.get('value').encode('utf-8')
    #h = HTMLParser.HTMLParser()
    #value = h.unescape(value)
    urls = regex.finditer(value)
    if urls is None: continue
    for turl in urls:
        ufile = turl.group().rsplit('/', 1)[-1]
        value = value.replace(turl.group(), '/'+newsrc+'/'+ufile)
        #value = cgi.escape(value, True)
        value = value.replace('\0', '')
    tag.set('value', value)
halfer
  • 19,824
  • 17
  • 99
  • 186
McLinux
  • 263
  • 1
  • 10
  • 1
    please provide some sample code of what you've tried so far – Gerard Rozsavolgyi Jan 10 '16 at 14:24
  • Added the code, Gerard – McLinux Jan 10 '16 at 14:32
  • Also, I am unsure about whether I need to escape and unescape the html code, the guy that wrote the html code made it messy – McLinux Jan 10 '16 at 14:33
  • 1
    Please do not add requests for urgency to any of your posts here, or anywhere on the web where your readership is mostly composed of volunteers. This sort of demanding tone is likely to turn off people from answering you - I removed it deliberately for this reason. I have downvoted as a reminder. Please do not add it again - this is called "edit warring" and will usually result in a moderator flag here. – halfer Jan 10 '16 at 16:34
  • 1
    It would help to see sample data which triggers the error. – tripleee Jan 10 '16 at 16:43
  • 1
    Thank you for the advice halfer! :) – McLinux Jan 10 '16 at 16:51

1 Answers1

0

It's easy. You only need to remove the encode('utf-8') part. You see LXML doesn't like people messing with the character encodings of strings. Just leave it to LXML to convert text into the suitable encoding and everything will be fine. :)

McLinux
  • 263
  • 1
  • 10