5

I'm currently using Airflow PostgresToGoogleCloudStorageOperator and GoogleCloudStorageToBigQueryOperator to export every table of my Postgres DB (hosted on AWS RDS) to BigQuery. It works but I have 75 tables, so Airflow creates 75 * 2 jobs. Since I'm new to Airflow, I don't know if this is a good practice.

In any case, I'd like to find a way to export all tables at once (pg_dump?) to GCS and then import them into BigQuery.

Nicoowr
  • 770
  • 1
  • 10
  • 29

2 Answers2

1

This is a matter of opinion really but I think it is fine to have a job with 72 x 2 tasks. It sounds like this job will be a once-off or non-frequent schedule? Bonus points if you use a YAML file to specify the tables and have your DAG look at the YAML file to get the list of tables to export.

If it were something you ran frequently then I think it would be better to use a a BashOperator and pg_dump as you suggested since Airflow can take quite a bit of time to schedule tasks after others have completed. You'd have to do some testing with your Airflow parallelism and scheduler interval (probably some other Airflow config options) to get it running efficiently.

Simon D
  • 5,730
  • 2
  • 17
  • 31
  • The job will be run once every night. As for the YML file, I'm currently using a PostgresHook to fetch all the tables names. It does the job quite well, but perhaps it's not idiomatic? Thanks for your insight anyway, I'll also consider using a BashOperator – Nicoowr Feb 03 '20 at 17:43
0

In case your Postgres instance is hosted using Google Cloud SQL:

You can try out the new Google Cloud Beta feature Cloud SQL federated queries where you create a connection resource for Postgres in the BigQuery project and you can use the connection within the EXTERNAL_QUERY method that returns you the BigQuery table reference for the wanted Postgres table.

It is a bit slow (data is being copied from Postgres to BigQuery) but you can materialize the connection results into a table using airflow (BigQueryOperator with destination_table option) and then when you will be querying the destination table the performance should be at least good as in Postgres.

Would that work, even doing so, you would need to materialize all tables from postgres manually/using some script that iterates over postgres tables.

fpopic
  • 1,756
  • 3
  • 28
  • 40
  • 1
    Thanks for your suggestion @fpopic, unfortunately my db is hosted on AWS which do not provide such a feature. But it might help other people! – Nicoowr Jan 31 '20 at 14:08