-1

I am trying to make a random wiki page generator which asks the user whether or not they want to access a random wiki page. However, some of these pages have accented characters and I would like to display them in git bash when I run the code. I am using the cmd module to allow for user input. Right now, the way I display titles is using

r_site = requests.get("http://en.wikipedia.org/w/api.php?action=query&list=random&rnnamespace=0&rnlimit=10&format=json")
print(json.loads(r_site.text)["query"]["random"][0]["title"].encode("utf-8"))

At times it works, but whenever an accented character appears it shows up like 25\xe2\x80\x9399.

Any workarounds or alternatives? Thanks.

user3084415
  • 75
  • 2
  • 8
  • you need to have your environment set to display unicode characters, which I don't think git bash does by default. – MattDMo Jan 09 '14 at 21:21
  • use `r_site.json()` instead of `json.loads(r_site.text)`. Drop `.encode('utf-8')` – jfs Jan 09 '14 at 21:22
  • Is there any difference between the two? -- I just tried it and I will occasionally get a charmap codec can't encode character error when an accented character shows up – user3084415 Jan 09 '14 at 21:24

2 Answers2

0
import sys

change your encode to .encode(sys.stdout.encoding, errors="some string")

where "some string" can be one of the following:

  • 'strict' (the default) - raises a UnicodeError when an unprintable character is encountered
  • 'ignore' - don't print the unencodable characters
  • 'replace' - replace the unencodable characters with a ?
  • 'xmlcharrefreplace' - replace unencodable characters with xml escape sequence
  • 'backslashreplace' - replace unencodable characters with escaped unicode code point value

So no, there is no way to get the character to show up if the locale of your terminal doesn't support it. But these options let you choose what to do instead.

Check here for more reference.

Brian Schlenker
  • 4,966
  • 6
  • 31
  • 44
  • Do you have any terminals you suggest using that have the most support for these characters? – user3084415 Jan 09 '14 at 22:18
  • Check @abarnert's answer, if they are correct in assuming you are using python 3.x then the terminal is a non-isue (unless you are using windows cmd prompt, which doesn't use utf-8) – Brian Schlenker Jan 09 '14 at 22:24
  • I'm using git bash and also have tried msysgit - both terminals still have issues. – user3084415 Jan 09 '14 at 23:44
0

I assume this is Python 3.x, given that you're writing 3.x-style print function calls.

In Python 3.x, printing any object calls str on that object, then encodes it to sys.stdout.encoding for printing.

So, if you pass it a Unicode string, it just works (assuming your terminal can handle Unicode, and Python has correctly guessed sys.stdout.encoding):

>>> print('abcé')
abcé

But if you pass it a bytes object, like the one you got back from calling .encode('utf-8'), the str function formats it like this:

>>> print('abcé'.encode('utf-8'))
b'abc\xce\xa9'

Why? Because bytes objects isn't a string, and that's how bytes objects get printed—the b prefix, the quotes, and the backslash escapes for every non-printable-ASCII byte.

The solution is just to not call encode('utf-8').

Most likely your confusion is that you read some code for Python 2.x, where bytes and str are the same type, and the type that print actually wants, and tried to use it in Python 3.x.

abarnert
  • 354,177
  • 51
  • 601
  • 671