0

In the book Data Engineering with Google Cloud Platform by Adi Wijaya, to load the data from a sql database to BigQuery, the author always load the data from sql to Google Cloud Storage first, and use it as staging environment, and only after that would he load data to BigQuery

What are the advantage of going through the GCS step and not straight away into BigQuery? In which case would you load directly data from SQL db to BigQuery?

le Minh Nguyen
  • 241
  • 1
  • 10
  • 3
    It depends on what you are trying to achieve. Exporting to Cloud Storage first is usually faster and ensures a consistent copy that is harder to accomplish than exporting directly to BigQuery - think about failure conditions. There is also a cost and speed difference based upon the method used to import into BigQuery. Personally, I usually export to Cloud Storage and then import to BigQuery. This provides a point-in-time backup and releases the workload on the database server sooner. Your question needs more context to recommend the best strategy. – John Hanley Aug 16 '22 at 22:49
  • Another alternative if for some reason you can't follow the storage approach is to spin a read replica and use it for the export, specially if you are going to perform this exports too frequently, i.e. every few seconds or in near real time – Chris32 Aug 17 '22 at 06:17
  • I see, I understand the part about data consistency, but I am not sure how does going through GCS help with speed? As we go through an extra step – le Minh Nguyen Aug 17 '22 at 07:47
  • 1
    Do not assume that an extra step means slower. In the database world, sequential reads and writes can be many times faster. So much faster that an export followed by an import can be faster than another service just reading the database. Transaction, indexes, etc affect performance. I am oversimplifying but the best answer requires details not present in your question. – John Hanley Aug 17 '22 at 08:16

1 Answers1

4

BigQuery doesn't support the SQL format as mentioned in this post to directly load data from Cloud SQL to BigQuery. You can follow the below procedures:

  1. You can use BigQuery Cloud SQL federated query importing data directly into BigQuery from Cloud SQL.
  2. Based on this documentation, you should first generate CSV or JSON from the Cloud SQL Database and persist those files to Cloud Storage and load data into BigQuery.

The advantages when loading data from Cloud SQL to Cloud Storage to BigQuery are:

  • Cloud storage provides services like resumable uploads, whereas combining the job and data means you'd need to be more careful about managing any issues with jobs, and concerning yourself with transient issues.
  • According to this documentation, using Cloud Storage you can take advantage of long term storage:

When you load data into BigQuery from Cloud Storage, you are not charged for the load operation, but you do incur charges for storing the data in Cloud Storage.

  • And as mentioned by @John Hanley, I agree that the advantage of loading data to Google Cloud storage to BigQuery it is faster and you can ensure a consistent copy or backup to be recovered in the event of a primary data failure.
  • BigQuery table can be deleted when not in use and imported when needed. And less likely to fail when creating a table.

Additional information, the cost of storing in BigQuery is higher than in Cloud storage. And you are subject to the following limitations when you load data into BigQuery from a Cloud Storage bucket.

To suggest the best strategy, your question needs more information. Still it depends on your use case. And for more information on loading data can be found in the BigQuery documentation.

Sarah Remo
  • 614
  • 1
  • 9