I am working on redesigning a data pipeline that is responsible for importing customer data in CSV format from cloud buckets that customers own(We have the connection details already) into a transactional database that we own. Constraints:
- We import ~10 million files from ~10 thousand customers everyday. Each customer can start multiple imports at any time.
- File Sizes vary from small (10 records) to Large (~millions)
Current Design does imports as they come and we are seeing issues with error handling, retries and backpressure management. Ideas:
- Use a staging blob store to stage imports and schedule jobs to import data.
- Use an orchestrator like Airflow with sensors to do a DAG run for each import.
Curious to see how the community would implement this?