Python: Output of beautifulsoup has wrong encoding

Question

I run in an encoding Problem, when a response is put in beautifulsoup. The readible-output of the response is formated in a proper way like Artikelstandort: Österreich, but after running beautifulsoup it will be transformed to Artikelstandort: Ã–sterreich. I'll provide you the changed code:

def formTest (browser, formUrl, cardName, edition):
   browser.open (formUrl)

   data = browser.response().read()
   with open ('analyze.txt', 'wb') as textFile:
      print 'wrinting file'
      textFile.write (data)

   #BS4 -> need from_encoding
   soup = BeautifulSoup (data, from_encoding = 'latin-1')
   soup = soup.encode ('latin-1').decode('utf-8')
   table = soup.find('table', { "class" : "MKMTable specimenTable"})

data has the correct data, but the soup has the wrong encoding. I tried various encoding/decoding on the soup, but got no working result.

The page where I pull my data from is: https://www.magickartenmarkt.de/Mutilate_Magic_2013.c1p256992.prod

Edit: I changed the encoding with prettify like suggested, but now i'm facing following error:

TypeError: slice indices must be integers or None or have an __index__ method

What was changed with prettify? I plotted the new output and the table is still in the "soup" (<table class="MKMTable specimenTable">)

Edit2:

New error is:

at: soup.encode ('latin-1').decode('utf-8')

Error: UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 518: invalid start byte

If I play with the encodings and decodings, errors with decoding some other byte will occur.

Also, this answer may be of help. http://stackoverflow.com/questions/7219361/python-and-beautifulsoup-encoding-issues — AlexLordThorsen, May 30 '13 at 18:35
updated the main-post. It's the same like you suggested and I allready tried and it resulted in the error like described above. — Rappel, May 30 '13 at 18:49
Are you getting the error from the Prettify line or the soup.find line? soup is originally a beautifulSoup object but prettify returns a Unicode string. — AlexLordThorsen, May 30 '13 at 19:14
error is from soup-object (soup.find). So I t need to transform the soup object or I try to transform the object returned from soup.find () — Rappel, May 30 '13 at 19:19
You can just pass it back to the BeautifulSoup constructor. What I hope will happen is that prettify will correct the encoding for you. — AlexLordThorsen, May 30 '13 at 19:23
In the linked answer "Finally got it, just had to: soup = BeautifulSoup(content, fromEncoding='latin-1') then when it got time to parse the links: i_title = item.contents[0].encode('latin-1').decode('utf-8') that seemed to do the trick. Thanks for your help :)" — AlexLordThorsen, May 30 '13 at 19:24
it's a good point, but it's a bit tricky if you don't know which encoding you encounter. I tried his trick, but decoding encoding throwing erros I'll supply in the top post. — Rappel, Jun 03 '13 at 20:50
You should know that `soup` is not of type `str` or type `unicode`, but instead of type `bs4.BeautifulSoup`. That object has a method called `decode(self, pretty_print=False, eventual_encoding='utf-8', formatter='minimal')`, as can be seen by inspecting it with `help`. I suspect BS automatically encodes it for you, and that you are then subsequently encoding it once more by calling `encode`. Just replace `soup.encode('latin-1').decode('utf-8')` with `soup.decode(eventual_encoding='utf-8')`. — Akshat Mahajan, Mar 08 '17 at 01:24

score 1 · Answer 1 · answered Mar 08 '17 at 01:11

You probably don't need the solution by now, but if anyone stops by here is what you should do:
You should probably use encoding proceedures on data and not on soup.
What I usally do is to use requests library to get raw response then take the text content by using a syntax like'response.text' then enforce the encoding with response.encoding='utf-8'.
At the very least, i feed the response.text to BeautifulSoup()

Python: Output of beautifulsoup has wrong encoding

1 Answers1