2

I have a Python program which crawls data from a site and returns a json. The crawled site has the meta tag charset = ISO-8859-1. Here is the source code:

url = 'https://www.example.com'
source_code = requests.get(url)
plain_text = source_code.text

After that I am getting the information with Beautiful Soup and then creating a json. The problem is, that some symbols i.e. the symbol are displayed as \u0080 or \x80 (in python) so I can't use or decode them in php. So I tried plain_text.decode('ISO-8859-1) and plain_text.decode('cp1252') so I could encode them afterwards as utf-8 but every time I get the error: 'ascii' codec can't encode character u'\xf6' in position 8496: ordinal not in range(128).

EDIT

the new code after @ChrisKoston suggestion using .content instead of .text

url = 'https://www.example.com'
source_code = requests.get(url)
plain_text = source_code.content
the_sourcecode = plain_text.decode('cp1252').encode('UTF-8')
soup = BeautifulSoup(the_sourcecode, 'html.parser')

encoding and decoding is now possible but still the character problem.

EDIT2

the solution is to set it .content.decode('cp1252')

url = 'https://www.example.com'
source_code = requests.get(url)
plain_text = source_code.content.decode('cp1252')
soup = BeautifulSoup(plain_text, 'html.parser')

Special thanks to Tomalak for the solution

Jobeso
  • 653
  • 1
  • 9
  • 11

1 Answers1

2

You must actually store the result of decode() somewhere because it does not modify the original variable.

Another thing:

  • decode() turns a list of bytes into a string.
  • encode() does the oposite, it turns a string into a list of bytes

BeautifulSoup is happy with strings; you don't need to use encode() at all.

import requests
from bs4 import BeautifulSoup

url = 'https://www.example.com'
response = requests.get(url)
html = response.content.decode('cp1252')
soup = BeautifulSoup(html, 'html.parser')

Hint: For working with HTML you might want to look at pyquery instead of BeautifulSoup.

Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • Thank your for your quick help. I edited the source code but the `€` character is still \x80 when I run the program – Jobeso Nov 17 '16 at 17:18
  • `\x80` is the character code for the Euro symbol. Don't look at the IDLE console, it displays characters this way when it wants to. Write the string to a file and look again. – Tomalak Nov 17 '16 at 17:23
  • this worked for the title now! thanks alot for that. the description is still not working. I´ll post the code in the question – Jobeso Nov 17 '16 at 23:59
  • Now everything is working. i had to replace the `.text` with `. content` there, too. Thank you so much for your help! – Jobeso Nov 18 '16 at 00:08