Python - How to get accented characters correct? (BeautifulSoup)

Question

I've write a s python code with BeautifulSoup to get HTML but not getting how to solve accented characters correct.

The charset of the HTML is this

<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

I've this python code:

some_text = soup_ad.find("span", { "class" : "h1_span" }).contents[0]
some_text.decode('iso-8859-1','ignore')

And I get this:

CalÃ§Ãµes

What I'm doing wrong here? Some clues?

Best Regards,

[Beautiful Soup uses Unicode internally](http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html). From Unicode you would *encode* NOT decode. — mechanical_meat, Feb 01 '13 at 18:53
Is this Python 2 or 3? And BS 3 or 4? It's always worth mentioning in Python questions, but when you're dealing with charset/encoding questions, it's absolutely critical. — abarnert, Feb 01 '13 at 21:32
@bernie: +1. Except that you should not be `encode`-ing if your goal is to put the data into a `sqlite3` database… — abarnert, Feb 01 '13 at 21:54

score 0 · Answer 1 · answered Feb 01 '13 at 18:47

0

The question here is about "where" do you "get this". If that's the output received in your terminal, it might as well be possible that your terminal expects a different encoding!

You can try this when using print:

import sys
outenc = sys.stdout.encoding or sys.getfilesystemencoding()
print t.decode("iso-8859-1").encode(outenc)

answered Feb 01 '13 at 18:47

enpenax

1,476
1
15
27

I'm outputting to an SQLite3 database. Not for the screen. Sorry for not explain that in the question. – André Feb 01 '13 at 18:51
What does the SQLite3 expect as incoming encoding? Depending on that, try to encode your string to that :) – enpenax Feb 01 '13 at 18:58
1

`sqlite3` databases are UTF-8, unless you explicitly change it at runtime (`PRAMGA encoding`) or change the default at compile time. IIRC, Py3 requires that you use Unicode (`str`) for all `sqlite3` methods, and breaks if the database isn't UTF-8, while Py2 allows you to use either 8-bit (`str`) or Unicode (`unicode`), but has all kinds of problems if the database and your 8-bit strings aren't UTF-8. – abarnert Feb 01 '13 at 21:31
Further, it's not necessary to encode for your terminal - stdout will do it for you. If Python is choosing the wrong encoding for your terminal, then you should change your locale, not change your code – Alastair McCormack Jul 02 '16 at 07:15

abarnert · Answer 2 · 2013-02-01T21:53:13.147

As bernie points out, BS uses Unicode internally.

For BS3:

Beautiful Soup Gives You Unicode, Dammit

By the time your document is parsed, it has been transformed into Unicode. Beautiful Soup stores only Unicode strings in its data structures.

For BS4, the docs explain a bit more clearly when this happens:

You can pass in a string or an open filehandle… First, the document is converted to Unicode, and HTML entities are converted to Unicode characters…`

In other words, it decodes the data immediately. So, if you're getting mojibake, you have to fix it before it gets into BS, not after.

The input to the BeautifulSoup constructor can take 8-bit byte strings or files, and try to figure out the encoding. See Encodings for details. You can check whether it guessed right by printing out soup.original_encoding. If it didn't guess ISO-8859-1 or a synonym, your only option is to make it explicit: decode the string before passing it in, open the file in Unicode mode with an encoding, etc.

The results that come out of any BS object, and anything you pass as an argument to any method, will always be UTF-8 (if they're byte strings). So, calling decode('iso-8859-1') on something you got out of BS is guaranteed to break stuff if it's not already broken.

And you don't want to do this anyway. As you said in a comment, "I'm outputting to an SQLite3 database." Well, sqlite3 always uses UTF-8. (You can change this with a pragma at runtime, or change the default at compile time, but that basically breaks the Python interface, so… don't.) And the Python interface only allows UTF-8 in Py2 str (and of course in Py2 unicode/Py3 str, there is no encoding.) So, if you try to encode the BS data into Latin-1 to store in the database, you're creating problems. Just store the Unicode as-is, or encode it to UTF-8 if you must (Py2 only).

If you don't want to figure all of this out, just use Unicode everywhere after the initial call to BeautifulSoup and you'll never go wrong.

Python - How to get accented characters correct? (BeautifulSoup)

2 Answers2