2

I'm implementing my first pipeline for "automated" data ingestion in my company. Our client doesn't want to let us make any call in their database (even create a replica,etc). The best solution I have thought until now is an endpoint (let them push the data to a storage), so we can consume it and carry on all the data science process. My cloud provider is Google Cloud and my client uses MySQL Server.

I have been reading many topics on the web and reached the following links:

Google Cloud Data Lifecycle - For batch processing it talks a bit about Cloud Storage, Cloud Transfer Appliance, Transfer Appliance

Signed URLs - These URLs are time-limited resources to access, for example, Google Cloud Storage, and write data into it.

My simple solution is user Signed URLs -> Cloud Storage -> Dataflow -> BigQuery. Is it a good approach?

To sum up, I am lloking for recomendations about best practices and possible ways to let the user insert data in GCP without exposing his data or my infrastructure.

Contraints:

  • Client will send data periodically (once a day ingestion)
  • Data is semi-structured (I will create and internal pipeline to make transformations)
  • After preprocess, data must be sent to BigQuery
Eduardo Humberto
  • 425
  • 2
  • 5
  • 16

2 Answers2

1

Signed URLs and Dataflow may not be necessary here. Signed URLs are generally used when you don’t want users to have a Google account to access Cloud Storage but also comes with more consideration when dealing with resumable uploads. If you know your client will have a Google account when pushing the data, then it can be skipped (especially since timeouts to protect private keys are not necessary since code is running in the backend and not in a client's mobile app for example). You could simply create a basic web app with App Engine which would be used by the client to perform the daily push, which would then upload it to the Cloud Storage bucket performing a resumable upload. App Engine would also make sure the files are in a proper format and follows specific constraints you would define before uploading it.

As for Dataflow, since its best use is for streaming and in your case it’s a passive batch ingestion, paying for a service that is constantly running when you need the transform to happen only once a day may not be the best approach. More efficient would be to use Cloud Functions to pre-process and apply the transforms, which would be triggered by object change notification in the Cloud Storage bucket. The function would then push the data to BigQuery using its API.

The complete flow would be: App Engine web app sanitizes the dump -> Storage API -> Bucket Object Change Notification -> Trigger Cloud Function (CF) -> CF downloads object -> CF performs the transform -> CF saves rows to BigQuery

GAE -> GCS -> CF -> BQ

PYB
  • 503
  • 6
  • 20
0

In my opinion, gsutil can do the job to push the data to cloud storage periodically. Generally, we can transfer files that are not too big with gsutil.

I would personally write a cron job that would include gsutil cp command to push the file from on-prem system to the cloud storage bucket.

reading from mysql and writing to the file can be done through simple spring boot job.

Mysql -> (Write to the file) -> file -> gsutil cp -> cloud storage

Suraj Mishra
  • 61
  • 1
  • 2