I am working on a batch data pipeline for a ML application (large scale) and have some questions
Data Ingestion Layer: In most sources that I have read so far Kafka is suggested for pulling data from the original source (a CSV in my case). When reading only about Kafka though, it is described as a tool mainly used for stream processing. Why should I use Kafka for batch processing then and is this really best practice? What are the alternatives?
Data Storage: I am also very indecisive when it comes to data storage for the batch processing pipeline. Most of the microservices seem to over-perform for my actual needs and I am wondering if there might be a lean solution. E.g. I have a dataset with structured data so MongoDB for example might be a bit excessive.
Further comments: so far, I have settled on using PySpark for Data Processing & Aggregation. Airflow & Docker for scheduling & orchestration
I have to decide on certain microservices to use and come up with a flow chart for the whole data pipeline. .