I'm implementing my first pipeline for "automated" data ingestion in my company. Our client doesn't want to let us make any call in their database (even create a replica,etc). The best solution I have thought until now is an endpoint (let them push the data to a storage), so we can consume it and carry on all the data science process. My cloud provider is Google Cloud and my client uses MySQL Server.
I have been reading many topics on the web and reached the following links:
Google Cloud Data Lifecycle - For batch processing it talks a bit about Cloud Storage, Cloud Transfer Appliance, Transfer Appliance
Signed URLs - These URLs are time-limited resources to access, for example, Google Cloud Storage, and write data into it.
My simple solution is user Signed URLs -> Cloud Storage -> Dataflow -> BigQuery. Is it a good approach?
To sum up, I am lloking for recomendations about best practices and possible ways to let the user insert data in GCP without exposing his data or my infrastructure.
Contraints:
- Client will send data periodically (once a day ingestion)
- Data is semi-structured (I will create and internal pipeline to make transformations)
- After preprocess, data must be sent to BigQuery