I've been at this for 2 days and cannot figure this out. This is my first foray into the world of Google Cloud Platform. I would like to do the following two things the easiest way possible:
Allow me to provide some context before I start: I would like to pull data from:
NYC Taxi and Limousine Commission or https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
I am interested in pulling all the months from 2019 (12 csv files, + 10GB). I would like to bring these 12 csv files into Google Cloud Platform and into Google Storage for later processing in BigQuery / Dataprep. Storing these files locally is not an option due to size.
First thing - Is there a way to bring in these files using the -wget command but for multiple .csv files? From my research, -wget is better for pulling in a single zip file or single (or a few) individual files. The data in the taxi website is not zipped but is available by month in csv format.
Second thing - If the -wget option is not a scalable option, I created a python file that contains script to pull / scrape data from that website I mentioned previously for those 12 months (2019). The script I am using is as follows:
import requests
import shutil
filenames = [ # probably could loop through these .._YYYY-MM
'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-01.csv',
'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-02.csv',
'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-03.csv',
'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-04.csv',
'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-05.csv',
'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-06.csv',
'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-07.csv',
'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-08.csv',
'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-09.csv',
'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-10.csv',
'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-11.csv',
'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-12.csv',
]
for x in filenames:
r = requests.get(x, stream = True)
if r.status_code == 200:
with open(x.split('_')[-1], 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
How do I use this Python script inside Google Cloud Platform to 'import' these 12 csv files into Google Storage for use in BigQuery and Dataprep? I tried using the SSH command line and activated conda but still couldn't figure this out (even after reviewing Google Cloud Platform user / reference guides!).
Perhaps someone may be able to assist me. Thank you for any help you can provide! :)