0

I am writing a python script for mass-replacement of links(actually image and script sources) in HTML files; I am using lxml. There is one problem, the html files are quizzes and they have data packaged like this(there is also some Cyrillic here):

<input class="question_data" value="{&quot;text&quot;:&quot;&lt;p&gt;[1] је наука која се бави чувањем, обрадом и преносом информација помоћу рачунара.&lt;/p&gt;&quot;,&quot;fields&quot;:[{&quot;id&quot;:&quot;1&quot;,&quot;type&quot;:&quot;fill&quot;,&quot;element&quot;:{&quot;sirina&quot;:&quot;103&quot;,&quot;maxDuzina&quot;:&quot;12&quot;,&quot;odgovor&quot;:[&quot;Информатика&quot;]}}]}" name="question:1:data" id="id3a1"/>

When I try to print out this data in python using:

print "OLD_DATA:", data

It just prints out the error "UnicodeEncodeError: character maps to undefined". There are more of these elements. My goal is to change the links of images in the value part of input, but I can't change the links if I don't know how to print this data(or how it should be written to the file). How does Python handle(interpret) this? Please help. Thanks!!! :)

McLinux
  • 263
  • 1
  • 10

1 Answers1

1

You're running into the same problem I've hit many times in the past. That error almost always means that the console environment you're using can't display the characters it's trying to print. It might be worth trying to log to a file instead, then opening the log in an editor that can display the characters.

If you really want to be able to see it on your console, it might be worth writing a function to screen the strings you're printing for unprintable characters

I also found a couple other StackOverflow posts that might be helpful in your efforts:

How do I get Cyrillic in the output, Python?

What is right way to use cyrillic in python lxml library

I would also recommend this article and python manual entry:

https://docs.python.org/2/howto/unicode.html

http://www.joelonsoftware.com/articles/Unicode.html

Community
  • 1
  • 1
Ketzak
  • 620
  • 4
  • 14
  • 1
    My pleasure! Unicode errors are the bane of my existence, as I do a lot of batch processing.. I usually wrap statements that would cause them in Try/Catch code that logs the error so that my program doesn't dump in the middle of a 30 minute run simply because it couldn't log something :P – Ketzak Jan 06 '16 at 20:19
  • 1
    And if you really need to get Unicode out to Windows's sad broken command prompt: https://pypi.python.org/pypi/win_unicode_console – bobince Jan 06 '16 at 23:21