-1

I do not understand why the following code keeps producing an empty string. I am trying to get the code to extract the contents of the website to a "txt" file, but it just keeps producing an empty string. Is there an error in the code?

import urllib3
import certifi


# Function: Convert information within html document to a text file
# Append information to the file
def html_to_text(source_html, target_file):

    http = urllib3.PoolManager(
        cert_reqs='CERT_REQUIRED',      # Force certificate check.
        ca_certs=certifi.where(),       # Path to the Certifi Bundle
        headers={'connection': 'keep-alive', 'user-agent': 'Mozilla/5.0', 'accept-encoding': 'gzip, deflate'},
    )

    r = http.urlopen('GET', source_html)
    print(source_html)
    response = r.read().decode('utf-8')
    # TODO: Find the problem that keeps making the code produce an empty string
    print(response)
    temp_file = open(target_file, 'w+')
    temp_file.write(response)


source_address = "https://sg.finance.yahoo.com/lookup/all?s=*&t=A&m=SG&r=&b=0"
target_location = "C:\\Users\\Admin\\PycharmProjects\\TheLastPuff\\Source\\yahoo_ticker_symbols.txt"

html_to_text(source_address, target_location)
Kevin
  • 74,910
  • 12
  • 133
  • 166
Cloud
  • 399
  • 4
  • 13
  • 1
    When you say "produce", do you mean "printed", or "written to a file", or "both printed and written to a file"? do `print(source_html)` and `print(response)` print anything or not? – Kevin Jan 05 '16 at 13:17
  • Both the print and write functions are not producing anything. "print(source_html)" does print the "source_address" successfully. – Cloud Jan 05 '16 at 13:27
  • The `r` object seems to have a `r.data` attribute that holds the response body. http://urllib3.readthedocs.org/en/latest/#usage – Jasper Jan 05 '16 at 13:32
  • @Cloud I just tested it on my computer, it works just fine, it prints and writes in the file the website source code. – Sidahmed Jan 05 '16 at 13:41
  • I think the problem seems to be that the website knows that a program is trying to scrape data off it. Therefore, it blocks its request. Is there a way around this? – Cloud Jan 05 '16 at 13:45
  • 1
    Shouldn't you respect the website owner's desire to not be scraped? – Kevin Jan 05 '16 at 13:47
  • sorry for this stupid question, but it doesn't print anything on the terminal ?!!? – Sidahmed Jan 05 '16 at 13:48
  • It just prints whitespace. – Cloud Jan 05 '16 at 14:09

1 Answers1

0

I get a response with the following code. The only relevant change is to use r.data instead of r.read().

import urllib3
import certifi


def html_to_text(source_html):

    http = urllib3.PoolManager(
        cert_reqs='CERT_REQUIRED',      # Force certificate check.
        ca_certs=certifi.where(),       # Path to the Certifi Bundle
        headers={'connection': 'keep-alive', 'user-agent': 'Mozilla/5.0',    'accept-encoding': 'gzip, deflate'},
    )

    r=http.urlopen('GET', source_html)
    print(source_html)
    print(r.headers)
    response = r.data                   # instead of read().decode('utf-8')
    print(response)


source_address = "https://sg.finance.yahoo.com/lookup/all?s=*&t=A&m=SG&r=&b=0"

html_to_text(source_address)

Used versions:

>>> certifi.__version__
'2015.11.20.1'
>>> urllib3.__version__
'1.14'
>>> sys.version
'3.5.1 (default, Dec  7 2015, 12:58:09) \n[GCC 5.2.0]'
Jasper
  • 3,939
  • 1
  • 18
  • 35
  • This code seems to be work, but I get another error: "urllib.error.HTTPError: HTTP Error 502: Server Hangup". I think this is the website kicking me out. – Cloud Jan 05 '16 at 14:08