0

I am trying to get a page from wikipedia's API and print that page to a file, using Python.

json_data = json.loads(issue_request(params_html))
document = json_data['parse']['text']['*'].encode('utf-8')
a = open('test.html', 'wb')
a.write(document)

The request I am issuing is to http://pt.wikipedia.org/w/api.php?action=parse&prop=text&page=Dia_dos_Namorados&format=json

The problem is that when I open 'test.html' in a browser, all the accented characters are rendered incorrectly, so I see things like: Dia de São Valentim.

I have tried all sorts of different encoding schemes, including encoding to 'latin-1' or using codecs, but none have so far worked. Interestingly, if I open the file in a text editor (sublime) the accented characters display fine. It's just in the browser that they look funny.

rmacqueen
  • 971
  • 2
  • 8
  • 22
  • Your text editor is probably defaulting to UTF-8. (If you're on Mac, or most recent Linux distros, that's your system default, so _most_ apps will probably use it.) That doesn't tell you anything about what programs that don't default to UTF-8 will show. – abarnert Jul 13 '13 at 00:54
  • changing the encoding to 'utf-16' seemed to fix it. Thanks everyone, i have a better understanding of the underlying problem now. – rmacqueen Jul 13 '13 at 01:00
  • You haven't really fixed anything, you've just picked an encoding that your browser is able to guess. (Which probably means your browser is Internet Explorer, right?) – abarnert Jul 13 '13 at 01:02

1 Answers1

3

You're saving an HTML fragment as UTF-8.

Normally you specify the character set for an HTML document by having a Content-Type, either in the HTTP headers, or in the HTML head. But you don't have HTTP (it's just a file), or an HTML head section (it's just a fragment), so there's no way to do either. So, your browser has to guess.

Most browsers in this case will default to Latin-1, although some will use your system character set instead, or offer a way to configure it, or will even try to magically guess. At any rate, if your browser tries to show UTF-8 as Latin-1, you'll end up with stuff like this:

Esta página ou secção foi marcada para revisão, …

Your browser probably has a way to override the default character set for a page. For example, with Safari, go to the View menu, then Text Encoding, then pick UTF-8, and:

Esta página ou secção foi marcada para revisão, …


So, how do you fix it permanently?

Well, you can't really fix it permanently, because there is no right way to store non-ASCII data in an HTML fragment. In fact, technically speaking, browsers shouldn't even be displaying HTML fragments like this as documents.

However, many browsers will let you toss in <meta> tags at the very top of a fragment. So, it could be just a matter of this:

a.write('<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">')
a.write(document)

But that won't work with every browser, and isn't supposed to work; it just happens to with many of them.

What should be legal is to wrap the fragment in a document. This can be as simple as something like:

a.write('''<!DOCTYPE html>
    <html>
    <head><meta http-equiv="Content-Type" content="text/html;charset=UTF-8"></head>
    <body>{}</body>
    </html>'''.format(document))

It's probably a better idea to figure out exactly which HTML version the page is written in, and use the appropriate doctype. But this should be good enough.

abarnert
  • 354,177
  • 51
  • 601
  • 671