4

I’m trying to automate the task of loading a postgres database into BigQuery as a daily job. The postgres database is in the form of a single sql file with all the tables and its data inside of it. This file is hosted on Google Cloud Storage (GCS). I want to take that sql file and load each table from it onto its own separate, corresponding table in BigQuery.

Most of the Q&As in StackOverflow about this assume that the sql file only contains one table, rather than multiple tables. Unfortunately, at the moment, it appears that BigQuery doesn't support a single file with multiple tables within itself. Further, I'm also looking for automated solutions, rather than a one-time thing. So, I am looking for other solutions.

There are two problems that I’m looking to solve:

  1. How to split a single sql dump file containing an entire database into separate csv files for each table?
  2. Which tool should I use to automate each step in the data pipeline?

I have some options for each problem. I'm looking to hear feedback on which option seems the most viable, as well as any additional strategies you may have heard of or used before.

  1. For splitting the file into separate csv files, per table:

    1. Use scripting to split the file into several csv files, one for each table. (Least scalable, especially as the database grows)
    2. Upload the sql file into Cloud SQL. Export each table as a separate csv file into GCS. Load each file into BigQuery. (Source: Best way to import Google Cloud SQL data into BigQuery) (Source: Upload Google Cloud SQL backup to Bigquery)
  2. For the tools to automate each step in the data pipeline: (I’m not terribly familiar with them, and hope I am corrected, if none of the tools below are the right choice.)

    1. Cloud Function - This tool can connect to various GCP APIs, and has the most promise, but I wanted to confirm whether this is the right tool before I start down this path. (Automate file upload from Google Cloud Storage to Bigquery)
    2. Cloud Composer - This tool seems to be a bit pricey from what I’ve gathered so far, but in the event it’s the only tool possible, I can look into it.
    3. DataFlow - I’m under the impression that this tool is only for data transformation for tables that have already been separated, and does not possess the capacity to perform import/export jobs, e.g., save a table from Cloud SQL into GCS as a csv file.
    4. Are there any other tools that call to GCP APIs and can kick off daily jobs that I haven’t listed on here?

Right now, I'm thinking about using Cloud Functions to orchestrate the following pipeline: GCS -> Cloud SQL -> Split into individual csv files -> Upload into GCS -> Bring to BigQuery. If there are any other alternate strategies that you know of, that’d be much appreciated.

truckbot
  • 83
  • 1
  • 8
  • Did you succeed to find a way to do it ? – charlesdg Sep 30 '19 at 09:24
  • 1
    @charlesdg In the end, we decided not to use any of these solutions. We were initially saving the data into GCS, then trying to upload that data into BigQuery. We ended up taking out GCS as a middle-step, and just wrote the data directly into BigQuery using a script, without using any of Google's tools. – truckbot Oct 01 '19 at 10:17

0 Answers0