3

I'm new in spark streaming and I have a general question relating to its usage. I'm currently implementing an application which streams data from a Kafka topic.

Is it a common scenario to use the application to run a batch only one time, for example, an end of the day, collecting all the data from the topic, do some aggregation and transformation and so on?

That means after starting the app with spark-submit all this stuff will be performed in one batch and then the application would be shut down. Or is spark stream build for running endless and permanently stream data in continuous batches?

Ruslan Ostafiichuk
  • 4,422
  • 6
  • 30
  • 35
Vik
  • 324
  • 3
  • 9
  • 3
    Spark streaming is for the latter, processing an infinite never ending stream of data. It processes that stream in configurable sized batches though. If it makes sense to dump all that data once a day from kafka into your spark cluster, then you could just run a daily spark job without spark streaming. – medloh Nov 28 '18 at 20:40
  • 1
    Take a look at Structured Streaming and [`Trigger.Once`](https://github.com/apache/spark/blob/86cc907448f0102ad0c185e87fcc897d0a32707f/sql/core/src/main/java/org/apache/spark/sql/streaming/Trigger.java#L90-L98). It is intended exactly for such processing mode. – 10465355 Nov 28 '18 at 22:44
  • Thank you all for the input. I guess I will use the KafkaUtils.createRDD to get a dataset within an offsetrange (https://spark.apache.org/docs/2.3.0/streaming-kafka-0-10-integration.html). – Vik Nov 30 '18 at 10:50

1 Answers1

2

You can use kafka-stream api, and fix a window-time to perform aggregation and transformation over events in your topic only one batch at a time. for move information about windowing check this https://kafka.apache.org/21/documentation/streams/developer-guide/dsl-api.html#windowing

Mehdi Bahra
  • 319
  • 1
  • 8