I want to write a ETL pipeline in spark handling different input sources but using as few computing resources as possible and have problem using 'traditional' spark ETL approach.
I have a number of streaming datasources which need to be persisted into DeltaLake tables. Each datasource is simply a folder in s3 with avro files. Each datasource has different schema. Each datasource should be persisted into it's own DeltaLake table. Little conversion other than avro -> delta is needed, only enrichment with some additional fields derived from filename. New files are added at a moderate rate, from once a min to once a day, depending on the datasource. I have a kafka notification when new data lands, describing what kind of data and s3 file path.
Assume there are two datasources - A and B. A is s3://bucket/A/* files, B - s3://bucket/B/*. Whenever new files is added I have a kafka message with payload {'datasource': 'A', filename: 's3://bucket/A/file1', ... other fields}. A files should go to delta table s3://delta/A/, B - s3://delta/B/
How can I ingest them all in a single spark application with minimal latency? As need data is constantly coming, sound like streaming. But in spark streaming one needs to define stream schema upfront, and I have different sources with different schema not known upfront.
Spinning up a dedicated spark application per datasource is not an option - there are 100+ datasources with very small files arriving. Having 100+ spark applications is a waste of money. All should be ingested using single cluster of moderate size.
The only idea I have now: in a driver process run a normal kafka consumer, for each record read a dataframe, enrich with additional fields and persist to it's delta table. More more parallelism - consume multiple messages and run them in futures, so multiple jobs run concurrently. Some pseudo-code, in a driver process:
val consumer = KafkaConsumer(...)
consumer.foreach{record =>
val ds = record.datasource
val file = record.filename
val df = spark.read.format(avro).load(file)
.withColumn('id', record.id)
val dest = s"s3://delta/${record.datasourceName}"
df.write.format('delta').save(dest)
consumer.commit(offset from record)
}
Sounds good (and PoC shows it works), but I wonder if there are other options? Any other ideas are appreciated. Spark runs in a DataBricks platform.