0

I've been at this for 2 days and cannot figure this out. This is my first foray into the world of Google Cloud Platform. I would like to do the following two things the easiest way possible:

Allow me to provide some context before I start: I would like to pull data from:

NYC Taxi and Limousine Commission or https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

I am interested in pulling all the months from 2019 (12 csv files, + 10GB). I would like to bring these 12 csv files into Google Cloud Platform and into Google Storage for later processing in BigQuery / Dataprep. Storing these files locally is not an option due to size.

First thing - Is there a way to bring in these files using the -wget command but for multiple .csv files? From my research, -wget is better for pulling in a single zip file or single (or a few) individual files. The data in the taxi website is not zipped but is available by month in csv format.

Second thing - If the -wget option is not a scalable option, I created a python file that contains script to pull / scrape data from that website I mentioned previously for those 12 months (2019). The script I am using is as follows:

import requests
import shutil

filenames = [ # probably could loop through these .._YYYY-MM
    'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-01.csv',
    'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-02.csv',
    'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-03.csv',
    'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-04.csv',
    'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-05.csv',
    'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-06.csv',
    'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-07.csv',
    'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-08.csv',
    'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-09.csv',
    'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-10.csv',
    'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-11.csv',
    'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-12.csv',
]

for x in filenames:
    r = requests.get(x, stream = True)
    if r.status_code == 200:
        with open(x.split('_')[-1], 'wb') as f:
            r.raw.decode_content = True
            shutil.copyfileobj(r.raw, f)

How do I use this Python script inside Google Cloud Platform to 'import' these 12 csv files into Google Storage for use in BigQuery and Dataprep? I tried using the SSH command line and activated conda but still couldn't figure this out (even after reviewing Google Cloud Platform user / reference guides!).

Perhaps someone may be able to assist me. Thank you for any help you can provide! :)

tlazas912
  • 13
  • 2
  • This thread shows how to mount a storage bucket using gcsfuse then running wget command (on your case, you can put the URLs on a text file and then run [`wget -i`](https://linuxize.com/post/wget-command-examples/#downloading-multiple-files)): https://stackoverflow.com/questions/28749589/google-cloud-storage-download-file-from-web – Donnald Cucharo Mar 16 '21 at 05:08
  • You can load gziped data into bigquery, and I bet these CSVs compress very well. – David Ehrmann Mar 16 '21 at 05:12
  • Fantastic! I appreciate your suggestions and will try this. – tlazas912 Mar 16 '21 at 16:02

1 Answers1

1

For a one-time loading of these files into Cloud Storage, I'd recommend creating a Compute Engine VM and then downloading and then uploading with gsutil. The Compute Engine VMs have high-bandwidth connections and should be able to download/upload 10GB without trouble.

  1. Create a Compute Engine VM and SSH into it. You could also use Cloud Shell, but it limits you to 5GB storage, so you'd need to chunk the download / uploads.
  2. Download the files. Below is an example using curl, but you could also do it with wget or in Python using the requests library.
curl "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-[01-12].csv" -o "tripdata_2019-#1.csv"
  1. Upload with gsutil
gsutil cp tripdata-2019-{01..12}.csv gs://your-bucket

Another option would be Storage Transfer Service. I haven't personally used that service yet, but it seems like to fit this use-case.

countunique
  • 4,068
  • 6
  • 26
  • 35