1

I want to write a ETL pipeline in spark handling different input sources but using as few computing resources as possible and have problem using 'traditional' spark ETL approach.

I have a number of streaming datasources which need to be persisted into DeltaLake tables. Each datasource is simply a folder in s3 with avro files. Each datasource has different schema. Each datasource should be persisted into it's own DeltaLake table. Little conversion other than avro -> delta is needed, only enrichment with some additional fields derived from filename. New files are added at a moderate rate, from once a min to once a day, depending on the datasource. I have a kafka notification when new data lands, describing what kind of data and s3 file path.

Assume there are two datasources - A and B. A is s3://bucket/A/* files, B - s3://bucket/B/*. Whenever new files is added I have a kafka message with payload {'datasource': 'A', filename: 's3://bucket/A/file1', ... other fields}. A files should go to delta table s3://delta/A/, B - s3://delta/B/

How can I ingest them all in a single spark application with minimal latency? As need data is constantly coming, sound like streaming. But in spark streaming one needs to define stream schema upfront, and I have different sources with different schema not known upfront.

Spinning up a dedicated spark application per datasource is not an option - there are 100+ datasources with very small files arriving. Having 100+ spark applications is a waste of money. All should be ingested using single cluster of moderate size.

The only idea I have now: in a driver process run a normal kafka consumer, for each record read a dataframe, enrich with additional fields and persist to it's delta table. More more parallelism - consume multiple messages and run them in futures, so multiple jobs run concurrently. Some pseudo-code, in a driver process:

val consumer = KafkaConsumer(...)
consumer.foreach{record =>
    val ds = record.datasource
    val file = record.filename
    val df = spark.read.format(avro).load(file)
        .withColumn('id', record.id)
    val dest = s"s3://delta/${record.datasourceName}"
    df.write.format('delta').save(dest)
    consumer.commit(offset from record)
}

Sounds good (and PoC shows it works), but I wonder if there are other options? Any other ideas are appreciated. Spark runs in a DataBricks platform.

1 Answers1

0

Spark does not constraint you to have a spark application per datasource ingestion, you can group datasources into a couple of spark app or you could go with one spark application for all the datasources, which is a feasible approach if the spark app have enough resources to ingest and process all the datasource.

You can do something like:

object StreamingJobs extends SparkApp {

  // consume from Kafka Topic 1
  StreamProcess_1.runStream(spark)

  // consume from Kafka Topic 2
  StreamProcess_2.runStream(spark)

  //  consume from Kafka Topic n
  StreamProcess_N.runStream(spark)

  // wait until termination
  spark.streams.awaitAnyTermination()

}

and maybe another spark jobs for batch processing

object BatchedJobs extends SparkApp {

  // consume from data source 1
  BatchedProcess_1.run(spark)

  // consume from  data source 2
  BatchedProcess_2.run(spark)

  //  consume from  data source n
  BatchedProcess_N.run(spark) 

}
dumitru
  • 2,068
  • 14
  • 23
  • I suppose, the question was more on "how to load multiple tables with different schemas", but not "how to make parallel jobs in Databricks". I'm searching for answer too. – gunn Nov 04 '21 at 08:18