-2

I've created a script to download a file from this link. The script most of the times download the file parially and as a result I can't see the content of that file. How can I force the script to download the file completely?

Here is the script I'm trying with:

import os
import requests

link = 'http://www.sidney.ca/Assets/Active+Development+Applications/2021/9633_Third_Street_Plans.pdf'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}
res = requests.get(link,headers=headers,stream=True)
with open('Third_Street_Plans.pdf', 'wb') as f:
    for chunk in res.iter_content(chunk_size=1024):
        if chunk:
            f.write(chunk)
MITHU
  • 113
  • 3
  • 12
  • 41
  • 1
    It looks like the remote server is, for lack of a better term, very wobbly. I keep getting connection timeouts. _If_ the server sends a Content-Length header, you could compare that against the number of bytes you actually end up writing (`f.tell()`), and if it differs, retry the whole shebang. – AKX Aug 21 '22 at 20:40
  • Why use that user-agent? Why use stream parameter? – thebadgateway Aug 22 '22 at 11:40
  • Do you have better ideas, @thebadgateway? – MITHU Aug 22 '22 at 13:33
  • Some servers are programmed to ignore or even incorrectly serve requests with spoofed user-agent values, if it’s possible to detect. The stream parameter might be involved in using chunked encoding, which some servers might not implement as it is part of a later HTTP specification. Not trying to be critical for its own sake, just trying to help – thebadgateway Aug 22 '22 at 14:57
  • Thanks for the pointer @thebadgateway. The information related to stream is new to me. – MITHU Aug 22 '22 at 16:22
  • @SMTH: This bounty is about to expire. There is at least one valid answer to this question. What's going on? – Barry the Platipus Aug 28 '22 at 20:35

5 Answers5

1

the thing about this problem is not your laptop or part of your code but http://www.sidney.ca, the if chunk might not be needed, but when I ran my below script with http://www.sidney.ca/Assets/Active+Development+Applications/2021/9633_Third_Street_Plans.pdf url, it did not work at all. It did not download a single thing, and I am just seeing this url for the first time. So I checked it again on Chrome and FireFox, it did not work too. Then I used a VPN and it downloaded on my browser.

Most importantly, I checked the size of the file you want to download, it's just around 5mb and I tested the code with a file of size around 18mb and it worked. http://www.javier8a.com/itc/bd1/articulo.pdf

import requests

link = 'http://www.javier8a.com/itc/bd1/articulo.pdf'
res = requests.get(link)
with open('the.pdf', 'wb') as f:
    f.write()

So, you can use VPN connection or rotate proxies to download it, it might work or not, the server has problems serving their content.

For using VPN with python Python 3 Implementation Python 2 Implementation Stackoverflow

1

My guess is that is not a matter of your code. Have a look on what happens when i request PDF file from browser, the domain bind name is never resolved, so the resource, or the file in this case, cannot be reached:

enter image description here

Carlos
  • 67
  • 5
0

Per the suggestion offered by thebadgateway, I removed the chunking and was able to download all 18 pages:

import requests

link = 'http://www.sidney.ca/Assets/Active+Development+Applications/2021/9633_Third_Street_Plans.pdf'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}
res = requests.get(link, headers=headers)
with open('Third_Street_Plans.pdf', 'wb') as f:
    f.write(res.content)
Booboo
  • 38,656
  • 3
  • 37
  • 60
  • Thanks for your answer @Booboo. I'll let you know once I'm done testing on it. – MITHU Aug 22 '22 at 16:26
  • I tried your script and yes I still get a corrupted file. Like I said, the script can sometimes download the file flawlessly but most of the times comes up with that strange behaviour. – MITHU Aug 22 '22 at 16:38
  • I have run this 10 times in a row and got a complete PDF each time. It even works with chunking. Perhaps you should post more details about your environment: What version of **requests** are you running? What is your platform? Are you running behind some type of special firewall? – Booboo Aug 22 '22 at 18:38
  • By the way, your test `if chunk:` is superfluous.; you should not be getting 0-length chunks,. – Booboo Aug 22 '22 at 18:46
  • requests version is 2.25.1 and OS is win 10. I got a partially downloaded pdf file running the script in sublime text editor and python's inbuilt IDE. – MITHU Aug 22 '22 at 19:27
  • I am running requests 2.26 (you can try updating to the latest) on Windows, too. Also try running the script from the command line, i.e. a Command Prompt window. I should have asked what Python version you are running. – Booboo Aug 22 '22 at 21:48
  • Upgrading to requests version ` 2.26.0` didn't really help. I'm now getting the same result I was having earlier using requests 2.25.1 – MITHU Aug 23 '22 at 04:13
  • Did you try running this from a Command Prompt? What version of Python are you running? Are you behind some corporate firewall that could be causing the problem? – Booboo Aug 23 '22 at 09:28
  • I’m still wondering if the the resource server has your IP address blacklisted, and is not fulfilling your requests. This does happen. Ways to verify if this is true include running the code on a different network, or if you can do that the relaying the request through a proxy. – thebadgateway Aug 23 '22 at 16:05
  • I'm using python 3.9.10. I ran the script using command prompt as well but no luck. Still getting partially downloaded pdf file. I run the script from my home @Booboo. – MITHU Aug 23 '22 at 16:34
  • I used rotation of proxies within a script to get the file downloaded but experienced the same behaviour. I even activated vpn and ran the script from different location with no success @thebadgateway. – MITHU Aug 23 '22 at 16:37
0

The following code will download that file (and should handle an eventual temperamental server too):

from httpx import stream

with stream("GET", "http://www.sidney.ca/Assets/Active+Development+Applications/2021/9633_Third_Street_Plans.pdf", timeout=600) as r:
    with open('third_street_plans_correct.pdf', 'wb') as f:
        for data in r.iter_bytes():
            f.write(data)

HTTPX docs : https://www.python-httpx.org/

Barry the Platipus
  • 9,594
  • 2
  • 6
  • 30
0

I think that problems in link. You script is correct. Try this one:

import requests

link = "http://www.sidney.ca/Assets/Active+Development+Applications/2021/9633_Third_Street_Plans.pdf"
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}
res = requests.get(link,headers=headers,stream=True).text
with open("file.pdf", "w") as f:
    f.write(res)