5

I want to download a file over 20GB from the internet into a google cloud bucket directly. Just like doing in a local command line the following:

wget http://some.url.com/some/file.tar 

I refuse to download the file to my own computer and then copying the file to the bucket using:

gsutil cp file.tar gs://the-bucket/

For the moment I am trying (just at this very moment) to use datalab to download the file and then copying the file from there to the bucket.

ericcco
  • 741
  • 3
  • 15
bsaldivar
  • 175
  • 2
  • 6
  • Does this answer your question? [Google cloud storage - Download file from web](https://stackoverflow.com/questions/28749589/google-cloud-storage-download-file-from-web) – kubanczyk Nov 03 '20 at 09:43

4 Answers4

7

A capability of the Google Cloud Platform as it relates to Google Cloud Storage is the functional area known as "Storage Transfer Service". The documentation for this is available here.

At the highest level, this capability allows you to define a source of data that is external to Google such as data available as a URL or on AWS S3 storage and then schedule that to be copied to Google Cloud Storage in the background. This function seems to perform the task you want ... the data is copied from an Internet source to GCS directly.


A completely different story would be the realization that GCP itself provides compute capabilities. What this means is that you can run your own logic on GCP through simple mechanisms such as a VM, Cloud Functions or Cloud Run. This helps us in this story by realizing that we could execute our code to download the Internet based data from within GCP itself to a local temp file. This file could then be uploaded into GCS from within GCP. At no time did the data that will end up in GCP ever go anywhere than from the source to Google. Once retrieved from the source, the transfer rate of the data from the GCP compute to GCS storage should be optimal as it is passing exclusively over Googles internal ultra high speed networks.

Kolban
  • 13,794
  • 3
  • 38
  • 60
  • 1
    Note that you need to provide the base64-encoded MD5 hash of the object in the URL list. If you don't know it before-hand you'd still need to download the file. – LundinCast Jul 24 '19 at 16:05
  • Thanks. But, do you know any way to do this from the command line? In addition, I would like to download data from kaggle, which doesn't provide any static url and I need to call the command of kaggle to download the dataset, so **Storage Transfer** doesn't help me here. – bsaldivar Jul 25 '19 at 07:21
  • Ive updated the answer with a second potential story for our consideration. – Kolban Jul 25 '19 at 14:42
5

You can do the curl http://some.url.com/some/file.tar | gsutil cp - gs://YOUR_BUCKET_NAME/file command from inside cloud shell on GCP. That way it never uses your own network and stays in GCP entirely.

dlenehan
  • 51
  • 1
  • 1
1

For large files, one-liners will very often fail, as will the Google Storage Transfer Service. Part two of Kolban's answer is then needed, and I thought I'd add a little more detail as it can take time to figure out the easiest way of actually downloading to a google compute instance and uploading to a bucket.

For many users, I believe the easiest way will be to open a notebook from the Google AI Platform and do the following:

%pip install wget
import wget
from google.cloud import storage       # No install required

wget.download('source_url', 'temp_file_name')
client = storage.Client()
bucket = client.get_bucket('target_bucket')
blob = bucket.blob('upload_name')
blob.upload_from_filename('temp_file_name')

No need to set up an environment, benefits from the convenience of notebooks, and the client will have automatic access to your bucket if the notebook is hosted using same GCP account.

Dharman
  • 30,962
  • 25
  • 85
  • 135
anton schwarz
  • 53
  • 1
  • 7
0

I found a similar post, where is explained that you can download a file from a Web and copy it to your bucket in just one command line:

curl http://some.url.com/some/file.tar | gsutil cp - gs://YOUR_BUCKET_NAME/file.tar

I tried in my own bucket and it works correctly, so I hope this is what you are expecting.

ericcco
  • 741
  • 3
  • 15
  • Thanks for the reply. However, it seems that from your approach you are still downloading to your local computer and then... copying the output (streaming?) to the bucket. – bsaldivar Jul 25 '19 at 09:07