1

Here is the full script:

import requests
import bs4


res = requests.get('https://example.com')
soup = bs4.BeautifulSoup(res.text, 'lxml')
page_HTML_code = soup.prettify()

multiline_code = """{}""".format(page_HTML_code)

f = open("testfile.txt","w+")
f.write(multiline_code)
f.close()

So I'm trying to write the entire Downloaded HTML as a file while keeping it neat and clean.

I do understand that it has problems with the text and can't save certain characters, but I'm not sure how to encode the text correctly.

Can anyone help?

This is the error message that I will get

"C:\Location", line 16, in <module>
    f.write(multiline_code)
  File "C:\\Anaconda3\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0421' in position 209: character maps to <undefined>
TacoCat
  • 459
  • 4
  • 21
  • Try `open('testfile.txt', 'wb')` for writing as a binary file. To read the file you will then need to open it with `open('testfile.txt', 'rb')`. – Engineero May 09 '18 at 16:56
  • Also, use `with open('testfile.txt', 'wb') as a_file:` followed by an indented `a_file.write(...)` instead of using explicit `open` and `close` statements. Context managers (the `with ... as ...:` syntax) are less likely to go wrong. – Engineero May 09 '18 at 16:57
  • @Engineero if I try your suggestions I will get a TypeError that says "A Bytes-like object is required, not 'str' " – TacoCat May 09 '18 at 16:59
  • My mistake, try adding `f.write(multiline_code.encode('ascii'))` when writing as binary. Basically you have to specify an encoding in Python 3, turns out: https://stackoverflow.com/a/29151455/3670871 – Engineero May 09 '18 at 17:02
  • @Engineero looks like it has a problem with one specific charecter, because I'm getting the same error, this time just that ascii can't encode the charecter \u0421. Any workaround this? – TacoCat May 09 '18 at 17:06
  • 1
    You could try encoding with [`.encode('utf-8')`](https://docs.python.org/3/library/stdtypes.html#str.encode), although I think you might have the same problem. You can also choose to ignore errors with [`.encode('utf-8', errors='ignore')` or one of several other options listed here](https://docs.python.org/3/library/stdtypes.html#str.encode). – Engineero May 09 '18 at 17:16
  • 1
    For instance, I think `.encode('utf-8', errors='backslashreplace')` may replace the unknown character with the literal string `'\u0421'`, so you wouldn't lose that information, but you may have to do something funky to decode it when you read it back. – Engineero May 09 '18 at 17:18
  • 1
    @Engineero thanks for your help. :) Just posted an answer to my own question that did the trick. – TacoCat May 09 '18 at 17:18
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/170709/discussion-between-tacocat-and-engineero). – TacoCat May 09 '18 at 17:19

1 Answers1

1

I did some digging around and this worked:

import requests
import bs4


res = requests.get('https://example.com')

soup = bs4.BeautifulSoup(res.text, 'lxml')

page_HTML_code = soup.prettify()



multiline_code = """{}""".format(page_HTML_code)

#add the Encoding part when opening file and this did the trick
with open('testfile.html', 'w+', encoding='utf-8') as fb:
    fb.write(multiline_code)
TacoCat
  • 459
  • 4
  • 21