How to prevent GCS from automatically decompressing objects when using Python SDK?

Question

I'm trying to download an object in GCS that is compressed, but I'm unable to download it without GCS automatically decompressing the file for me. I want to be able to download the gzip myself, and then decompress locally.

If I go to my object in the GCS gui, I can view the object metadata and see the following:

Content-Type: application/json
Content-Encoding: gzip
Cache-Control: no-transform

Also, if I right click the Authenticated URL in the console and click Save Link As, I get a gzip archive, so I know that this file is actually an archive.

I read on GCS's documentation that you can set Cache-Control: no-transform then "the object is served as a compressed object in all subsequent requests".

Except when I use the code below to download the GCS object it's downloaded as a JSON object, not as a gzip archive:

bucket = storage_client.get_bucket("bucketname")
blob = bucket.blob("objectname")
stringobj = blob.download_as_text()
bytesobj = blob.download_as_bytes()
blob.download_to_filename("test.json.gz")

I've tried three different methods for downloading the object and they're all downloading the files as JSON objects.

Just to validate that the object does in fact have the correct headers, I ran the following:

blob.reload()
print(f"Content encoding: {blob.content_encoding}")
print(f"Content type: {blob.content_type}")
print(f"Cache control: {blob.cache_control}")

>> Content encoding: gzip
>> Content type: application/json
>> Cache control: no-transform

I'm not sure what else I could try.

Your problem is `Content-Encoding`: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Encoding You can delete that header and instead apply `Content-Type: application/gzip`. — John Hanley, May 28 '21 at 21:06
@JohnHanley I'm not sure I agree that this is actually a problem. Per Google's documentation, the way I have it set up `gives the most information about the state of the object to anyone accessing it. Doing so also makes the object eligible for decompressive transcoding when it is later downloaded, allowing client applications to handle the semantics of the Content-Type correctly.` It also says `If the Cache-Control metadata field for the object is set to no-transform, the object is served as a compressed object in all subsequent requests..` — Kyle, May 28 '21 at 21:33
So according to their documentation having `no-transform` set on the `Cache-Control` header should prevent decompressive transcoding from happening. — Kyle, May 28 '21 at 21:34
For example, a web server, proxy, etc can process a file for a client. The server can apply compression, so it adds the header `Content-Encoding` meaning it transformed the original object. The client (browser) now knows to untransform the object. In your case, decompress it. `The Content-Encoding representation header lists any encodings that have been applied to the representation (message payload), and in what order. This lets the recipient know how to decode the representation in order to obtain the original payload format` — John Hanley, May 28 '21 at 21:52

Donnald Cucharo · Accepted Answer · 2021-05-31T02:09:45.303

I reproduced your problem. I followed your input and got similar behavior as I downloaded a gzip archive with the filename having .gz extension. However, gunzip -ing the file returns an error:

Example.json.gz: not in gzip format

The solution is to use raw_download=True to download the raw gzip archive to prevent decompressive transcoding from happening.

Example:

blob.download_to_filename("test.json.gz", raw_download=True)

How to prevent GCS from automatically decompressing objects when using Python SDK?

1 Answers1