Opinion: Querying databases from Spark streaming or Structured streaming tasks

Question

We have a Spark streaming use case where we need to compute some metrics from ingested events (in Kafka), but the computations require additional metadata which are not present in the events.

The obvious design pattern I can think of is to make point queries to the metadata tables (on the master DB) from spark executor tasks and use that metadata info during the processing of each event.

Another idea would be to "enrich" the ingested events in a separate pipeline as a preprocessor step before sending them to Kafka. This could be done, say by another service or task.

The second scenario is more useful in cases when the domain/environment where Spark/hadoop runs is isolated from the domain of the master DB where all metadata is stored.

Is there a general consensus on how this type of event "enrichment" should be done? What other considerations am I missing here?

score 1 · Answer 1 · answered Oct 29 '19 at 12:58

Typically the first approach that you thought about is correct and meets your requirements.

There is know that within Apache Spark you can join data-in-motion with data-at-rest.

In other words you have your streaming context that continuously stream data from Kafka.

val dfStream = spark.read.kafka(...)

At the same time you can connect to the metastore DB (e.g spark.read.jdbc)

val dfMetaDb = spark.read.jdbc(...)

You can join them together

dsStream.join(dfMetaDB)

and continue the process from this point on. The benefits is that you don't touch other components and rely only on Spark processing capabilities.

Opinion: Querying databases from Spark streaming or Structured streaming tasks

1 Answers1