1

what is best way to transfer all records from BigQuery table to Cloud SQL table on daily basis (every day expected approximate count of records more than 255801312 [255 million]). I know we can create dataflow pipelines from BQ to CloudSQL, but this large amount of data will run for hours and hours. Any best solution to implement within google cloud?.

Srinivas B
  • 13
  • 4
  • Did you try to export the data in CSV from BigQuery, and to import them in Cloud SQL? It's the most efficient way – guillaume blaquiere Nov 10 '21 at 21:37
  • exporting data in to csv and importing to cloudsql is a manual process, not sure we can follow this approach for every day maintenance that reads 255 millions records. – Srinivas B Nov 11 '21 at 04:14
  • Manual? or a [workflow](https://cloud.google.com/workflows) ;) – guillaume blaquiere Nov 11 '21 at 14:11
  • @guillaumeblaquiere, I see your post: https://medium.com/google-cloud/merge-clean-transform-your-csvs-easily-with-bigquery-3a73c0c26d57 EXPORT DATA OPTIONS is might be the right choice but please guide me what is the best way this to run through work flows(kind of dataflow jobs) that can also help to import the files from google storage bucket in to Cloudsql – Srinivas B Nov 11 '21 at 21:41
  • Workflow orchestrate API Calls. First perform a call to BigQuery to export the data in a bucket, then call Cloud SQL API to import the files. – guillaume blaquiere Nov 11 '21 at 22:31
  • I built a working example. Not with the size of your dataset and I can't tell you that will be more efficient. I will write an article on that. Your feedback will be welcomed (I can mention you if you have a Medium account) – guillaume blaquiere Nov 12 '21 at 21:48
  • @guillaumeblaquiere This is absolutely great answer, I would like to take this for production implementation, As you mentioned please share your article soon that might help with additional information, may include best practices for production as well, I am following you on Medium – Srinivas B Nov 14 '21 at 22:18
  • Here the article: https://medium.com/google-cloud/replicate-data-from-bigquery-to-cloud-sql-2b23a08c52b1 – guillaume blaquiere Nov 23 '21 at 08:54

1 Answers1

1

Here a working example of Workflow. You need to give enough permission to your workflow service account (cloudsql admin, bigquery dataviewer + job user, cloud storage admin) and the table must exist in your Cloud SQL instance (I tested with MySQL).

The article is cooking with more detail in it. Replace the bucket, the projectid, the Cloud SQL instance name (mysql in my case), the query, the table name, the database schema

main:
  steps:
    - assignStep:
        assign:
          - bucket: "TODO"
          - projectid: "TODO"
          - prefix: "workflow-import/export"
          - listResult:
              nextPageToken: ""
    - export-query:
        call: googleapis.bigquery.v2.jobs.query
        args:
          projectId: ${projectid}
          body:
            query: ${"EXPORT DATA OPTIONS( uri='gs://" + bucket + "/" + prefix + "*.csv', format='CSV', overwrite=true,header=false) AS SELECT id, email FROM `copy_dataset.name_test`"}
            useLegacySql: false
    - importfiles:
        call: import_files
        args:
          pagetoken: ${listResult.nextPageToken}
          bucket: ${bucket}
          prefix: ${prefix}
          projectid: ${projectid}
        result: listResult
    - missing-files:
        switch:
          - condition:  ${"nextPageToken" in listResult}
            next: importfiles


import_files:
  params:
    - pagetoken
    - bucket
    - prefix
    - projectid
  steps:
    - list-files:
        call: googleapis.storage.v1.objects.list
        args:
          bucket: ${bucket}
          pageToken: ${pagetoken}
          prefix: ${prefix}
        result: listResult
    - process-files:
        for:
          value: file
          in: ${listResult.items}
          steps:
            - wait-import:
                call: load_file
                args:
                  projectid: ${projectid}
                  importrequest:
                    importContext:
                      uri: ${"gs://" + bucket + "/" + file.name}
                      database: "test_schema"
                      fileType: CSV
                      csvImportOptions:
                        table: "workflowimport"
    - return-step:
        return: ${listResult}


load_file:
  params: [importrequest,projectid]
  steps:
    - callImport:
        call: http.post
        args:
          url: ${"https://sqladmin.googleapis.com/v1/projects/" + projectid + "/instances/mysql/import"}
          auth:
            type: OAuth2
          body: ${importrequest}
        result: operation
    - chekoperation:
        switch:
          - condition: ${operation.body.status != "DONE"}
            next: wait
        next: completed
    - completed:
        return: "done"
    - wait:
        call: sys.sleep
        args:
          seconds: 5
        next: getoperation
    - getoperation:
        call: http.get
        args:
          url: ${operation.body.selfLink}
          auth:
            type: OAuth2
        result: operation
        next: chekoperation

More detail in my medium article

guillaume blaquiere
  • 66,369
  • 2
  • 47
  • 76