Code keeps producing an empty string

Question

I do not understand why the following code keeps producing an empty string. I am trying to get the code to extract the contents of the website to a "txt" file, but it just keeps producing an empty string. Is there an error in the code?

import urllib3
import certifi


# Function: Convert information within html document to a text file
# Append information to the file
def html_to_text(source_html, target_file):

    http = urllib3.PoolManager(
        cert_reqs='CERT_REQUIRED',      # Force certificate check.
        ca_certs=certifi.where(),       # Path to the Certifi Bundle
        headers={'connection': 'keep-alive', 'user-agent': 'Mozilla/5.0', 'accept-encoding': 'gzip, deflate'},
    )

    r = http.urlopen('GET', source_html)
    print(source_html)
    response = r.read().decode('utf-8')
    # TODO: Find the problem that keeps making the code produce an empty string
    print(response)
    temp_file = open(target_file, 'w+')
    temp_file.write(response)


source_address = "https://sg.finance.yahoo.com/lookup/all?s=*&t=A&m=SG&r=&b=0"
target_location = "C:\\Users\\Admin\\PycharmProjects\\TheLastPuff\\Source\\yahoo_ticker_symbols.txt"

html_to_text(source_address, target_location)

When you say "produce", do you mean "printed", or "written to a file", or "both printed and written to a file"? do `print(source_html)` and `print(response)` print anything or not? — Kevin, Jan 05 '16 at 13:17
Both the print and write functions are not producing anything. "print(source_html)" does print the "source_address" successfully. — Cloud, Jan 05 '16 at 13:27
The `r` object seems to have a `r.data` attribute that holds the response body. http://urllib3.readthedocs.org/en/latest/#usage — Jasper, Jan 05 '16 at 13:32
@Cloud I just tested it on my computer, it works just fine, it prints and writes in the file the website source code. — Sidahmed, Jan 05 '16 at 13:41
I think the problem seems to be that the website knows that a program is trying to scrape data off it. Therefore, it blocks its request. Is there a way around this? — Cloud, Jan 05 '16 at 13:45
Shouldn't you respect the website owner's desire to not be scraped? — Kevin, Jan 05 '16 at 13:47
sorry for this stupid question, but it doesn't print anything on the terminal ?!!? — Sidahmed, Jan 05 '16 at 13:48

score 0 · Answer 1 · answered Jan 05 '16 at 13:52

I get a response with the following code. The only relevant change is to use r.data instead of r.read().

import urllib3
import certifi


def html_to_text(source_html):

    http = urllib3.PoolManager(
        cert_reqs='CERT_REQUIRED',      # Force certificate check.
        ca_certs=certifi.where(),       # Path to the Certifi Bundle
        headers={'connection': 'keep-alive', 'user-agent': 'Mozilla/5.0',    'accept-encoding': 'gzip, deflate'},
    )

    r=http.urlopen('GET', source_html)
    print(source_html)
    print(r.headers)
    response = r.data                   # instead of read().decode('utf-8')
    print(response)


source_address = "https://sg.finance.yahoo.com/lookup/all?s=*&t=A&m=SG&r=&b=0"

html_to_text(source_address)

Used versions:

>>> certifi.__version__
'2015.11.20.1'
>>> urllib3.__version__
'1.14'
>>> sys.version
'3.5.1 (default, Dec  7 2015, 12:58:09) \n[GCC 5.2.0]'

This code seems to be work, but I get another error: "urllib.error.HTTPError: HTTP Error 502: Server Hangup". I think this is the website kicking me out. — Cloud, Jan 05 '16 at 14:08

Code keeps producing an empty string

1 Answers1