1

I'm trying to programmatically download a pre-signed S3 URL. I know that the file I'm downloading is an ASCII-text file. When downloading the URL by copy-paste into Chrome, the file is indeed as I would expect (see below). However, with wget the downloaded file is binary.

Looking into previous posts about this, unfortunately I couldn't find much that helped me. The posts suggest to add quotes around the URL, but my URL does not contain special characters. Some of the posts I checked: Amazon AWS S3 signed URL via Wget, https://superuser.com/questions/1311516/curl-can-not-download-file-but-browser-can. (I actually double-checked anyway with double and single quotes, neither worked in my case).

➜  wget --no-check-certificate --no-proxy  "https://s3.eu-central-1.amazonaws.com/.../text_file.txt"
--2022-07-28 10:49:57--  https://s3.eu-central-1.amazonaws.com/.../text_file.txt
Resolving s3.eu-central-1.amazonaws.com (s3.eu-central-1.amazonaws.com)... 52.219.75.159
Connecting to s3.eu-central-1.amazonaws.com (s3.eu-central-1.amazonaws.com)|52.219.75.159|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21110 (21K) [binary/octet-stream]
Saving to: ‘text_file.txt’

text_file.txt                                     100%[===========================================================================================================>]  20.62K  --.-KB/s    in 0.004s  

2022-07-28 10:49:57 (5.61 MB/s) - ‘text_file.txt’ saved [21110/21110]

➜  file text_file.txt                                                                                                                                           
text_file.txt: data
➜  cat text_file.txt | head -n 1
[78!???ÊBz?j????X?????x>??_uߩi??a?Qqax?W?ϴ??_c????H???u?c??}???U??5?M?|A?-9?H?Y??\?՟??B?l
2ɯL????:?JZF㽬???,2?gn????Y~vU?l4?O`?!???r                                               ?h?1?]??f???
                                          ?MIUM??_??q?u?dC???v?MbcI>?R??oV???&?
# Following lines are for a file downloaded by copy-paste of the URL to a Chrome window
➜  file text_file\ \(1\).txt 
text_file (1).txt: ASCII text
➜  cat text_file\ \(1\).txt| head -n 1 
# Header of file
John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
user2416984
  • 941
  • 1
  • 11
  • 18
  • If you add `-d` to the wget command, what does the `---response begin---` section show? – Anon Coward Jul 28 '22 at 16:01
  • Here is the response: ```---request begin--- GET .../text_file.txt HTTP/1.1 Host: s3.eu-central-1.amazonaws.com User-Agent: Wget/1.21.3 Accept: */* Accept-Encoding: identity Connection: Keep-Alive ``` This all looks fine to me? – user2416984 Jul 29 '22 at 07:29
  • What does the response look like, not the request? – Anon Coward Jul 29 '22 at 13:08
  • Sorry, here it is: ```---response begin--- HTTP/1.1 200 OK x-amz-id-2: eMcXc3+kNlyc0gPPHTeM61BJpmr1yzaG7NVfQQwTTC9eOINMt7ZPTqogJGNjnH1ITnTUhFWcEd8= x-amz-request-id: 0G7QWAMZ7DTH19Q5 Date: Sun, 31 Jul 2022 12:41:47 GMT Last-Modified: Thu, 28 Jul 2022 07:39:30 GMT ETag: "75e8f5916eedb2237a27bdb609248b03" Content-Encoding: br Accept-Ranges: bytes Content-Type: binary/octet-stream Server: AmazonS3 Content-Length: 21110``` – user2416984 Jul 31 '22 at 12:43
  • 1
    However that file was uploaded to S3 caused it to be compressed. Browsers will react to the Content-Encoding and decompress it on the fly, but wget will not. You either need to upload it without Brotli compression, or decompress it after download. – Anon Coward Jul 31 '22 at 19:55
  • That's it, I should have looked at the encoding. The file generated by ```wget 'https://s3.eu-central-1.amazonaws.com/.../text_file.txt' | brotli -d -o text_file2.txt text_file.txt``` is indeed ASCII text. Feel free to post this as an aswer and I'll accept it. – user2416984 Aug 01 '22 at 07:03

1 Answers1

1

The content you have is likely compressed in S3. When a file is compressed using a common compression like GZip, Brotli, LZW, or Zlib and marked with the appropriate content encoding, most browsers will decompress the file on the fly, either for display or download.

For instance, if we upload a simple HTML file, but compress it:

$ cat example_file.html | brotli | \ 
    aws s3 cp - s3://example-bucket/example_html_br.html \
    --acl=public-read --content-encoding br

Then we can view the contents in the browser, the browser engine is decompressing the file:

Browser showing HTML file

But attempting to download the file from WGet shows the compressed contents:

$ wget -qO- https://example-bucket.s3.amazonaws.com/example_html_br.html | hexdump -C
00000000  1f 6e 00 00 1d 07 ee be  1d 1b 46 77 12 aa 15 78  |.n........Fw...x|
00000010  a8 dc d4 d4 5b 83 cc a0  a5 81 96 1c b0 b7 d5 6d  |....[..........m|
00000020  29 46 f6 fa 6e 63 eb 29  ea aa 82 c8 25 a8 42 91  |)F..nc.)....%.B.|
00000030  ce 1d 07 f6 06 e1 52 0f  f4 4a a9 d6 87 17 76 ff  |......R..J....v.|
00000040  e1 da 01                                          |...|

You can verify this by looking at the HTTP headers:

$ wget -S https://example-bucket.s3.amazonaws.com/example_html_br.html
--2022-08-01 14:10:40--  https://example-bucket.s3.amazonaws.com/example_html_br.html
Resolving example-bucket.s3.amazonaws.com (example-bucket.s3.amazonaws.com)... 52.218.178.75
  [...]
  HTTP/1.1 200 OK
  Content-Encoding: br

Here showing the content-encoding that the browser triggers off of. Either you'll need to ensure that whatever component that places this content in S3 in the first place doesn't compress it, or if you want to download the content, then you'll need to decompress it as the browser does:

wget -qO- https://example-bucket.s3.amazonaws.com/example_html_br.html | brotli -df
<html>
<head>
<title>Example</title>
[...]

The same premise holds true if you're using pre-signed URLs.

Anon Coward
  • 9,784
  • 3
  • 26
  • 37