3

I am using cloud composer to orchestrate ETL for files arriving in GCS going to BigQuery. I have a cloud function that triggers the dag when a file arrives and the cloud function passes the file name/location to the DAG. In my DAG I have 2 tasks:

1) Use the DataflowPythonOperator to run a dataflow job that reads the data from text in GCS and transforms it and inputs it into BQ, and 2) moves the file to a failure/success bucket depending on whether the job fails or succeeds. Each file has a file ID which is a column in the bigquery table. Sometimes a file will be edited once or twice (it’s not a streaming thing where it’s often) and I want to be able to delete the existing records for that file first.

I looked into the other airflow operators, but wanted to have 2 tasks in my DAG before I run the dataflow job:

  1. Get the file id based on the file name (right now I have a bigquery table mapping file name -> file ID but I can also just bring in a json that serves as the map I guess if that’s easier)
  2. If the file ID is already present in the bigquery table (the table that outputs the transformed data from the dataflow job), delete it, then run the dataflow job so I have the most updated information. I know one option is to just add a time stamp and only use the most up to date records, but because there could be 1 million records per file and it is not like I am deleting 100 files a day (maybe 1-2 tops) that seems like it could be messy and confusing.

After the dataflow job, ideally before moving the file to the success/failure folder I would like to append to some “records” table saying that this game was inputted at this time. This will be my way to see all the inserts that occurred. I have tried to look for different ways to do this, I am new to cloud composer so I don’t have a clear cut idea of how this would work after 10+ hours of research or else I would post code for input.

Thanks, I am very thankful for everyone's help and apologize if this is not as clear as you would like, the documentation on airflow is very robust but given cloud composer and bigquery is relatively new, it is difficult to learn as thoroughly how to do some GCP specific tasks.

WIT
  • 1,043
  • 2
  • 15
  • 32

1 Answers1

3

It sounds a bit complicated. Gladly, there are operators for pretty much every GCP service. Another thing is when to trigger the DAG to execute. Have you figured that out? You'd want to trigger a Google Cloud Function to run every time a new file comes in to that GCS bucket.

  1. Triggering your DAG

To trigger the DAG, you'll want to call it using a Google Cloud Function that relies on Object Finalize or Metadata Update triggers.

  1. Loading data to BigQuery

If your file is already in GCS, and in JSON or CSV format, then using a Dataflow job is overkill. You can use the GoogleCloudStorageToBigQueryOperator to load the file to BQ.

  1. Keeping track of the file ID

Likely the best thing to compute the file ID is using a Bash or Python operator from Airflow. Are you able to derive it directly from the file name?

If so, then you can have a Python operator that is upstream of a GoogleCloudStorageObjectSensor to check whether the file is in the successful directory.

If it is, then you can use the BigQueryOperator to run a delete query on BQ.

After that, you run the GoogleCloudStorageToBigQueryOperator.

  1. Moving files around

If you are moving files from GCS to GCS locations then the GoogleCloudStorageToGoogleCloudStorageOperator should do the trick you need. If your BQ load operator fails, then move to the failed files location, and if it succeeds then move to the successful jobs location.

  1. Logging task logs

Perhaps all you need to keep track of inserts is logging task information to GCS. Check out how to log task information to GCS

Does that help?

Pablo
  • 10,425
  • 1
  • 44
  • 67
  • So I already have done the cloud function step and the dataflow job (the dataflow is because it is 1 million records and airflow was not as quick unless i am wrong). The thing is i need to be able to access the file ID once i calculate it. I can calculate it by using the filename and doing a simple lookup in a bigquery table. After i calculate it though, i need my template to know what it is to be able to use that in my query. does that make sense? – WIT Jan 14 '19 at 18:31
  • Airflow does not import the file itself. It triggers a BQ Load job for the file - which is very fast, and free (unlike a Dataflow job). - As for the file id, let me get back to you in a bit – Pablo Jan 14 '19 at 18:45
  • Are you able to do (1) Import data to a temporary table - (2) Wait 1 hour for file updates (3) If file updates, then delete temporary table (4) If file does not update, copy temporary table to destination table ? – Pablo Jan 14 '19 at 22:46
  • unfortunately no, and often times updates will not be within 1 hour, they could be 24 hours later and they could be 1 month later or even a year later. – WIT Jan 15 '19 at 02:42
  • In that case, yes - you should have a (1) Python operator that computes the File ID. (2) A BigQueryOperator that runs a delete query for all rows with that file ID. (3). A DataflowPythonOperator that runs a job that adds the File ID to the rows, and inserts t BigQuery. How about that? – Pablo Jan 15 '19 at 18:24
  • As for storing the file ID, I think you'll be better off storing it in GCS instead of BQ. GCS should be cheaper. – Pablo Jan 15 '19 at 19:23
  • I think that makes sense, the reason I want to look in BQ for the file ID is because there is a rare chance it could change and I don't want to have to worry about updating it on GCS, id rather have it in BQ – WIT Jan 16 '19 at 01:07
  • I am a little confused how I would use the python operator and then get the value itself to use for the next task thoguh – WIT Jan 16 '19 at 01:07
  • You can use xcoms to pass values between tasks: http://airflow.apache.org/concepts.html#xcoms – Pablo Jan 16 '19 at 04:48
  • did that help? happy to continue lookingiinto it – Pablo Jan 17 '19 at 01:03