5

I have an weird error. There's a file on dropbox which i'm downloading with the following python code:

import requests
import shutil

url = 'https://www.dropbox.com/s/fgyso9fq40qp1vl/testfiles.tar.gz?dl=0'
r = requests.get(url, stream=True)
path_to_save = "/tmp/data.dload-1"
with open(path_to_save, 'wb') as f:
    shutil.copyfileobj(r.raw, f)  

this downloads to /tmp/data.dload-1.

same file downloaded with wget wget https://www.dropbox.com/s/fgyso9fq40qp1vl/testfiles.tar.gz?dl=0 -O /tmp/data.dload-2

these two files have the same type:

(dl)x:x$ file /tmp/data.dload-1 
/tmp/data.dload-1: gzip compressed data, from Unix
(dl)x:x$ file /tmp/data.dload-2 
/tmp/data.dload-2: gzip compressed data, last modified: Thu Apr 26 23:05:15 2018, from Unix

but un-taring them produces different results:

(dl)x:x$ tar -zxvf /tmp/data.dload-1
tar: This does not look like a tar archive
tar: Skipping to next header
tar: Exiting with failure status due to previous errors
(dl) x:x$ tar -zxvf /tmp/data.dload-2
testfiles/a
testfiles/b
(dl)x:x$ 

anybody has any idea why this might happen and more importantly how can i download that tar file with Python (preferably requests)

This is the result from r.headers: (dl) x:x$ python dload-test.py {'Server': 'nginx', 'Date': 'Fri, 27 Apr 2018 17:27:06 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'Cache-Control': 'no-cache', 'Content-Security-Policy': "script-src 'unsafe-eval' https://www.dropbox.com/static/compiled/js/ https://www.dropbox.com/static/javascript/ https://www.dropbox.com/static/api/ https://cfl.dropboxstatic.com/static/compiled/js/ https://www.dropboxstatic.com/static/compiled/js/ https://cfl.dropboxstatic.com/static/js/ https://www.dropboxstatic.com/static/js/ https://cfl.dropboxstatic.com/static/previews/ https://www.dropboxstatic.com/static/previews/ https://cfl.dropboxstatic.com/static/api/ https://www.dropboxstatic.com/static/api/ https://cfl.dropboxstatic.com/static/cms/ https://www.dropboxstatic.com/static/cms/ https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/ 'unsafe-inline' ; img-src https://* data: blob: ; frame-ancestors 'self' ; default-src 'none' ; frame-src https://* carousel://* dbapi-6://* dbapi-7://* dbapi-8://* itms-apps://* itms-appss://* ; worker-src https://www.dropbox.com/static/serviceworker/ blob: ; style-src https://* 'unsafe-inline' 'unsafe-eval' ; connect-src https://* ws://127.0.0.1:*/ws ; object-src 'self' https://cfl.dropboxstatic.com/static/ https://www.dropboxstatic.com/static/ https://flash.dropboxstatic.com https://swf.dropboxstatic.com https://dbxlocal.dropboxstatic.com ; media-src https://* blob: ; font-src https://* data: ; child-src https://www.dropbox.com/static/serviceworker/ blob: ; form-action 'self' https://www.dropbox.com/ https://dl-web.dropbox.com/ https://photos.dropbox.com/ https://accounts.google.com/ https://api.login.yahoo.com/ https://login.yahoo.com/ ; base-uri 'self' api-stream.dropbox.com showbox-tr.dropbox.com ; report-uri https://www.dropbox.com/csp_log", 'Dropbox-Streaming': 'V=1', 'Pragma': 'no-cache', 'Referrer-Policy': 'origin-when-cross-origin', 'Set-Cookie': 'locale=en; Domain=dropbox.com; expires=Wed, 26 Apr 2023 17:27:06 GMT; Path=/; secure, gvc=OTU0NjExNzUwNjc0NjQxNzgwMzE0OTgzMzkzNjc3MzM5OTYzNzc%3D; expires=Wed, 26 Apr 2023 17:27:06 GMT; httponly; Path=/; secure, flash=; Domain=dropbox.com; expires=Fri, 27 Apr 2018 17:27:06 GMT; Path=/; secure, puc=; expires=Fri, 27 Apr 2018 17:27:06 GMT; httponly; Path=/; secure, bang=; Domain=dropbox.com; expires=Fri, 27 Apr 2018 17:27:06 GMT; Path=/; secure, seen-sl-signup-modal=VHJ1ZQ%3D%3D; expires=Sun, 27 May 2018 17:27:06 GMT; httponly; Path=/; secure, t=HlsAKcFI_HJWteio0_5ELyFf; Domain=dropbox.com; expires=Mon, 26 Apr 2021 17:27:06 GMT; httponly; Path=/; secure, __Host-js_csrf=HlsAKcFI_HJWteio0_5ELyFf; expires=Mon, 26 Apr 2021 17:27:06 GMT; Path=/; secure', 'X-Content-Type-Options': 'nosniff', 'X-Dropbox-Request-Id': 'b028e94ce7b814c7f25fb753449b641a', 'X-Frame-Options': 'DENY', 'X-Robots-Tag': 'noindex, nofollow, noimageindex', 'X-Xss-Protection': '1; mode=block', 'Strict-Transport-Security': 'max-age=15552000; includeSubDomains', 'Content-Encoding': 'gzip'}

src
  • 337
  • 4
  • 10
  • See http://docs.python-requests.org/en/master/user/quickstart/#raw-response-content – chepner Apr 27 '18 at 17:20
  • You might also want to look at a diff of the hexdumps of each file, to see if there are lots of differences or one tiny difference that is enough to confuse `tar`. – chepner Apr 27 '18 at 17:22
  • Print out `r.headers` (and post the result here). – abarnert Apr 27 '18 at 17:25
  • @abarnert updated. – src Apr 27 '18 at 17:29
  • 1
    @src Wow, that's a lot more headers than I expected. But the very last one is the one I was looking for confirmation of; I updated my answer to match your updated question. (You may also want to look at `r.request.headers`, as I mentioned in the answer, but you don't necessarily need to update the question with that unless it's different than I expected or you don't understand why it's that way.) – abarnert Apr 27 '18 at 17:59

2 Answers2

11

The problem that the file is being gzip-compressed, even though it's already a gzipped file (as can be seen from the 'Content-Encoding': 'gzip' field in r.headers).

You're using the default request headers, for both requests and wget. Both of them will, by default, send something like 'Accept-Encoding: gzip, deflate'. (You can see this if you print out r.request.headers.) So the server can easily gzip the file and send it back with a 'Content-Encoding: gzip' header.

Both wget and requests will, by default, detect that header and transparently decode the data for you—but you've explicitly told requests not to do that, and read the raw data as-is.

So you end up saving a file which is a gzip-compressed-gzip-compressed-tarball. Obviously, file will report that as gzip compressed data, and tar -z will report that what's inside the gzip does not look like a tar archive, because it isn't, it's a gzipped tar archive.

The smallest fix here is to manually add headers={'Accept-Encoding': 'identity'} to your request.


You may wonder why the server is bothering to gzip-compress a gzipped file—just because you told it you can accept gzip doesn't mean you're demanding gzip, right?

If you look at RFC 2616 and RFC 7231, the server is supposed to pick the encoding with the highest qvalue (weight) as specified by the client that it can support (breaking ties according to some heuristic that isn't specified). If your user agent explicitly asks for 'gzip, deflate', giving you identity would be incorrect unless it's actually impossible to do otherwise, not slightly silly.

Community
  • 1
  • 1
abarnert
  • 354,177
  • 51
  • 601
  • 671
2

This is crazy, but changing the 0 at the end of the URL to 1 works. motivated from this SO post.

src
  • 337
  • 4
  • 10