urllib - Retrieved file failed to be opened

Question

I want to retrieve a pdf file from a given link. The command line output shows that the file has been saved at the specified location

import os
myPath = 'C:\\Documents'
filename = 'test1.pdf'
url = 'http://www.ha.org.hk/visitor/ha_view_content.asp?content_id=253124&lang=ENG'
fullfilename = os.path.join(myPath, filename)
urlretrieve(url, fullfilename)

>>> ('C:\\Documents\\test1.pdf', <http.client.HTTPMessage object at 0x016E0BB0>)

However when I go to the file directory, the test1.pdf looks corrupted.

The downloaded file is only 1 KB in size however the actual file should be around 4MB.

score 0 · Answer 1 · answered Aug 18 '19 at 14:28

the url = 'http://www.ha.org.hk/visitor/ha_view_content.asp?content_id=253124&lang=ENG' from which you are trying to download the pdf return

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title>HA</title>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
        <link href="/visitor/v3/css/style-en.css" rel="stylesheet" type="text/css" />
        <script language="javascript" src="/visitor/v3/script/stylesheet.js" type="text/javascript"></script>
        <script language="javascript" type="text/javascript" src="/visitor/v3/common/js/function.js"></script>
        <script language="JavaScript" src="common_functions.js"></script>
    </head>
    <body id="iframebody">
        <div id="contentarea">
            <script>window.open('/haho/ho/bssd/KCCTE105218BTSa.pdf', '_self');</script>
        </div>
    </body>
</html>

which is downloaded by the urlretrieve(url, fullfilename) and saved by in the file. that is why the file size is only 1 KB.

You can try this url , 'http://www.ha.org.hk/haho/ho/bssd/KCCTE105218BTSa.pdf' which is the redirect one or create this from the output of the above request.

hi, yes correct. The actual pdf url path is only available upon clicking. — Afiq Johari, Aug 18 '19 at 14:34
Is there anyway to retrive the pdf link without clicking on the url? It seems that I can only see the pdf link once I clicked on the url. — Afiq Johari, Aug 18 '19 at 14:40

score 0 · Answer 2 · answered Aug 18 '19 at 14:39

The question is poorly worded.

However I finally get a working solution.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlretrieve



driver = webdriver.Chrome(executable_path='chromedriver_win32/chromedriver.exe')

# open the download link, note that download link doesn't show the pdf url yet
driver.get('http://www.ha.org.hk/visitor/ha_view_content.asp?content_id=253237&lang=ENG')

# retrieve the current url once the previous url is opened, which should contain the pdf url
urlretrieve(driver.current_url)

urllib - Retrieved file failed to be opened

2 Answers2