How to download a file from the internet to a Google Cloud bucket directly

Question

I want to download a file over 20GB from the internet into a google cloud bucket directly. Just like doing in a local command line the following:

wget http://some.url.com/some/file.tar

I refuse to download the file to my own computer and then copying the file to the bucket using:

gsutil cp file.tar gs://the-bucket/

For the moment I am trying (just at this very moment) to use datalab to download the file and then copying the file from there to the bucket.

Does this answer your question? [Google cloud storage - Download file from web](https://stackoverflow.com/questions/28749589/google-cloud-storage-download-file-from-web) — kubanczyk, Nov 03 '20 at 09:43

Kolban · Accepted Answer · 2019-07-25T14:42:16.100

A capability of the Google Cloud Platform as it relates to Google Cloud Storage is the functional area known as "Storage Transfer Service". The documentation for this is available here.

At the highest level, this capability allows you to define a source of data that is external to Google such as data available as a URL or on AWS S3 storage and then schedule that to be copied to Google Cloud Storage in the background. This function seems to perform the task you want ... the data is copied from an Internet source to GCS directly.

A completely different story would be the realization that GCP itself provides compute capabilities. What this means is that you can run your own logic on GCP through simple mechanisms such as a VM, Cloud Functions or Cloud Run. This helps us in this story by realizing that we could execute our code to download the Internet based data from within GCP itself to a local temp file. This file could then be uploaded into GCS from within GCP. At no time did the data that will end up in GCP ever go anywhere than from the source to Google. Once retrieved from the source, the transfer rate of the data from the GCP compute to GCS storage should be optimal as it is passing exclusively over Googles internal ultra high speed networks.

Note that you need to provide the base64-encoded MD5 hash of the object in the URL list. If you don't know it before-hand you'd still need to download the file. — LundinCast, Jul 24 '19 at 16:05
Thanks. But, do you know any way to do this from the command line? In addition, I would like to download data from kaggle, which doesn't provide any static url and I need to call the command of kaggle to download the dataset, so **Storage Transfer** doesn't help me here. — bsaldivar, Jul 25 '19 at 07:21
Ive updated the answer with a second potential story for our consideration. — Kolban, Jul 25 '19 at 14:42

score 5 · Answer 2 · answered Mar 04 '21 at 21:55

5

You can do the curl http://some.url.com/some/file.tar | gsutil cp - gs://YOUR_BUCKET_NAME/file command from inside cloud shell on GCP. That way it never uses your own network and stays in GCP entirely.

answered Mar 04 '21 at 21:55

dlenehan

51
1
1

score 1 · Answer 3 · edited Mar 18 '21 at 19:25

For large files, one-liners will very often fail, as will the Google Storage Transfer Service. Part two of Kolban's answer is then needed, and I thought I'd add a little more detail as it can take time to figure out the easiest way of actually downloading to a google compute instance and uploading to a bucket.

For many users, I believe the easiest way will be to open a notebook from the Google AI Platform and do the following:

%pip install wget
import wget
from google.cloud import storage       # No install required

wget.download('source_url', 'temp_file_name')
client = storage.Client()
bucket = client.get_bucket('target_bucket')
blob = bucket.blob('upload_name')
blob.upload_from_filename('temp_file_name')

No need to set up an environment, benefits from the convenience of notebooks, and the client will have automatic access to your bucket if the notebook is hosted using same GCP account.

score 0 · Answer 4 · answered Jul 25 '19 at 07:32

0

I found a similar post, where is explained that you can download a file from a Web and copy it to your bucket in just one command line:

curl http://some.url.com/some/file.tar | gsutil cp - gs://YOUR_BUCKET_NAME/file.tar

I tried in my own bucket and it works correctly, so I hope this is what you are expecting.

answered Jul 25 '19 at 07:32

ericcco

741
3
15

Thanks for the reply. However, it seems that from your approach you are still downloading to your local computer and then... copying the output (streaming?) to the bucket. – bsaldivar Jul 25 '19 at 09:07

How to download a file from the internet to a Google Cloud bucket directly

4 Answers4