Cloud SQL to BigQuery ETL tool

Question

I have a Cloud SQL instance with hundreds of databases, one for each customer. Each database has the same tables in it, but data only for the specific customer.

What I want to do with it, is transform in various ways so to get an overview table with all of the customers. Unfortunately, I cannot seem to find a tool that can iterate over all the databases a Cloud SQL instance has, execute queries and then write that data to BigQuery.

I was really hoping that Dataflow would be the solution but as far as I have tried and looked online, I cannot find a way to make it work. Since I spent a lot of time already on investigating Dataflow, I thought it might be best to ask here.

Currently I am looking at Data Fusion, Datastream, Apache Airflow. Any suggestions?

score 0 · Answer 1 · answered Nov 29 '22 at 23:53

Airflow can be used for this sort of thing (essentially, you're doing the same task over and over, so with an appropriate operator and a for-loop you can certainly generate a DAG with hundreds of near-identical tasks that export each of your databases).

However, I'd be remiss not to ask: should you?

There may be a really excellent reason why you've created hundreds of databases in one instance, rather than one database with a customer field on each table. Yet if security is paramount, a row level security policy could add an additional element of safety without putting you in this difficult situation. Adding an index over the customer field would allow you to retrieve the appropriate sub-table swiftly (in return for a small speed cost when inserting new rows) so performance also doesn't seem like a reason to do this.

Given that it would then be pretty straightforward to get your data into BigQuery I would be moving heaven and earth to switch over to this setup, if I were you!

Thanks for your input! I am wondering the same, but as I new hire I have no idea. I will explore our options to rebuild it or replicate somehow, that would be the best :) — AleksF, Dec 01 '22 at 16:10

score 0 · Accepted Answer · answered Dec 02 '22 at 14:00

0

Why Dataflow doesn't fit your needs? You could run a query to find out the tables, and then iteratively build the Pipeline/JdbcIO sources/PCollections based on those results. Beam has a Flatten transform that can join PCollections.

What you are trying to do is one of the use cases why Dataflow Flex Templates was created (to have dynamic DAG creation within Dataflow itself) but that can be pulled without Flex Templates as well.

answered Dec 02 '22 at 14:00

Bruno Volpato

1,382
10
18

1

This is something I am not that familiar with, so I want to check if I understood you correctly. Basically, you suggest - creating a template that collects data from a single database - get a list of all databases - iterate through that list correct? If so, I was considering that but it seemed like a bad idea to create hundreds of different pipelines. I would also need to turn them all into jobs via Dataflow that would be a huge list and hard to manage, right? Anyway, I will look into the flex templates, somehow missed that completely. Thanks! – AleksF Dec 02 '22 at 17:34
You got the idea correctly, but you can have multiple JdbcIO sources in the same pipeline, and then flatten together to send to the same sink. No need to spawn multiple pipelines. – Bruno Volpato Dec 02 '22 at 18:02
Ohh, ok, that makes much more sense, thanks for the tip! – AleksF Dec 05 '22 at 07:52
I am having a bit of a hard time understanding how to configure the JdbcIO connection. Could you maybe help out with that a little as well? – AleksF Dec 05 '22 at 15:26
We have this template as an example https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v1/src/main/java/com/google/cloud/teleport/templates/JdbcToBigQuery.java#L19. Where it does reads from JdbcIO, we could have a loop and adding the resulting `PCollection` to a PCollectionList, and then a `collections.apply(Flatten.pCollections());` after that. See https://beam.apache.org/documentation/transforms/java/other/flatten/ for an example. – Bruno Volpato Dec 05 '22 at 17:00
Hey Bruno, thanks for all the help and suggestions! Unfortunately, I cannot even get the connection working. I using Python SDK io.jdbc module but it won't' work for me. I have described everything in more detail here- https://stackoverflow.com/questions/74732989/cannot-connect-to-cloud-sql-using-apache-beam-jdbc I would greatly appreciate if you could take a look and give some feedback. It must be something simple and silly that I am missing... – AleksF Dec 08 '22 at 15:57
Hey, I am still struggling to create a connection even to a local instance, but I am slowly moving forward. If you have the time for it, I would appreciate if you took a look at this questions- https://stackoverflow.com/questions/74941087/expansion-service-failed-to-build-transform – AleksF Dec 28 '22 at 13:56

Cloud SQL to BigQuery ETL tool

2 Answers2