I need to organize a large amount of historical data into date-partitions in google bigquery. It will partition the load date for you (current date only) but that doesn't really help with historical data. The only solutions I have seen so far are to do this manually for each date using date flags until google builds out the tool a little more. Any solutions?
-
2I recommend to to "fix" you post by splitting it on actual question (and post it as a question) and then (as looks like you have great answer already ) just answer your own question as a answer (not as a part of question). You can also read [How to Ask](http://stackoverflow.com/help/how-to-ask). – Mikhail Berlyant Jan 26 '17 at 21:48
-
I would echo that this seems like a useful post. As an example of a self-answered question, see http://stackoverflow.com/questions/41638651/how-do-i-update-a-nested-record-in-bigquery-using-dml-syntax. – Elliott Brossard Jan 26 '17 at 23:14
-
thanks for the suggestions :) – Ralph Feb 01 '17 at 17:46
1 Answers
I created my own pipeline and have included it below. To run it put all the code blocks from this post together.
import datetime
from subprocess import call
start = datetime.date( year = 2016, month = 10, day = 24)
#end = datetime.date( year = 2016, month = 10, day = 01 )
end = datetime.date.today()
file_type = ['activity', 'click', 'impression', 'rich_media']
dataset = 'dataset_name'
The bellow script will copy files from one GCS bucket to another. google double click files have creation date in the file name, this script uses that creation date to determine what date partition to place the file in.
#pull all the files into our own buckets on gcs so that we dont lose data after 3 months
call("gsutil -m cp -r gs://bucket/path/to/mysource* gs://mybucket/path/to/mydata, shell=True)
This seemed like the best way to distribute historical data quickly. I would like to open the individual files up and place each row into the correct partition instead, but I don't know how.
#create list of dates based on stat and end date supplied by user at begining of script
def daterange( start_date, end_date ):
if start_date <= end_date:
for n in range( ( end_date - start_date ).days + 1 ):
yield start_date + datetime.timedelta( n )
else:
for n in range( ( start_date - end_date ).days + 1 ):
yield start_date - datetime.timedelta( n )
I added try/except for error handling at the bottom but I don't think it actually does anything because there will never be an error in calling, if the table name is messed up it will create a job error on the server end but wont actually stop the process or hurt anything.
Using the --nosync flag allows me to use call for asynchronous jobs, originally I was using popen but I don't think popen cleans up after itself (I think call does?) so this seemed like a better option.
#creates a bunch of jobs to upload GCS files into GBQ partitioned tables
for bucket in file_type:
for date in daterange( start, end ):
date = str(date).replace('-','')
source = 'gs://mybucket/path/to/mydata' + '*'
table = bucket + '$' + date
try:
process = call("bq --nosync load --skip_leading_rows=1 --replace dev."\
+ table + ' ' + source, shell=True)
except:
print 'missing ' + bucket + ' data for ' + date

- 65
- 1
- 6