Save Contents of URL to Text File

Question

I'm trying to save the contents of a URL to a text file. I found several some sample scripts online to do this, and the two below seem like good candidates to help me do what I want to do, but both return this error:

TypeError: a bytes-like object is required, not 'str'

import html2text
import urllib.request

with urllib.request.urlopen("http://www.msnbc.com") as r:
    html_content = r.read()
rendered_content = html2text.html2text(html_content)
file = open('C:\\Users\\Excel\\Desktop\\URL.txt', 'w')
file.write(rendered_content)
file.close()



import sys
if sys.version_info[0] == 3:
    from urllib.request import urlopen
else:
    # Not Python 3 - today, it is most likely to be Python 2
    # But note that this might need an update when Python 4
    # might be around one day
    from urllib import urlopen
# Your code where you can use urlopen
with urlopen("http://www.msnbc.com") as r:
    s = r.read()
rendered_content = html2text.html2text(html_content)
file = open('C:\\Users\\Excel\\Desktop\\URL.txt', 'w')
file.write(rendered_content)
file.close()

I'm probably missing something simple here, but I can't tell what it is.

I am using Python 3.6.

Possible duplicate of [TypeError: a bytes-like object is required, not 'str' in python and CSV](https://stackoverflow.com/questions/34283178/typeerror-a-bytes-like-object-is-required-not-str-in-python-and-csv) — Nathan Vērzemnieks, Jan 30 '18 at 02:37
Please include at least the last few lines of the traceback - exactly where the error is occurring is critical information here. — Nathan Vērzemnieks, Jan 30 '18 at 02:57

score 3 · Answer 1 · edited Jun 20 '20 at 09:12

You need to add the method decode('utf-8') to your text :

with urlopen("http://www.msnbc.com") as r:
    s = r.read().decode('utf-8')

The variable s contains a string of bytes and need to be decoded. The reason of the error is a problem of distinction between unicode strings and bytes :

Python 3's standard string type is Unicode based, and Python 3 adds a dedicated bytes type, but critically, no automatic coercion between bytes and unicode strings is provided. The closest the language gets to implicit coercion are a few text-based APIs that assume a default encoding (usually UTF-8) if no encoding is explicitly stated. Thus, the core interpreter, its I/O libraries, module names, etc. are clear in their distinction between unicode strings and bytes. Python 3's unicode support even extends to the filesystem, so that non-ASCII file names are natively supported.

This string/bytes clarity is often a source of difficulty in transitioning existing code to Python 3, because many third party libraries and applications are themselves ambiguous in this distinction. Once migrated though, most UnicodeErrors can be eliminated.

Source : https://www.python.org/dev/peps/pep-0404/#strings-and-bytes

Apologies for my confusion earlier! I shouldn't try to comment on answers when I don't have a computer to test things out myself. +1 — Nathan Vērzemnieks, Jan 30 '18 at 04:48

Vox · Accepted Answer · 2018-01-30T03:00:50.113

1

try:

str(content, encoding = "utf-8")

In your code is:

rendered_content = html2text.html2text(str(html_content, encoding = "utf-8"))

edited Jan 30 '18 at 03:00

answered Jan 30 '18 at 02:52

Vox

506
2
13

Save Contents of URL to Text File

2 Answers2