I am new to Spark.
I have a spark streaming batch job(maybe it should be structure streaming) which receive data from kafka horuly.
And I found my spark keeps consuming data and would not stop.
So I want to control it, for example,
Now it is 3 am, and my spark should consume data between 2~3 am from kafka topic, next hour should consume 3~4 am
Any idea about it? Thanks.
--------Code---------
SparkConf sparkConf = new SparkConf().setAppName("CalculateHourlyFromKafka");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.minutes(60));
JavaInputDStream<ConsumerRecord<String, String>> stream = KafkaUtils.createDirectStream(
jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.Subscribe(topicsSet, kafkaParams));
stream.foreachRDD((rdd, time) -> {
SparkSession spark = JavaSparkSessionSingleton.getInstance(rdd.context().getConf());
JavaRDD<Span> rowRDD = rdd.map(message -> {
Reader stringReader = new StringReader(message.value());
List<Span> spanList = new CsvToBeanBuilder(stringReader).withType(Span.class).build().parse();
return spanList.get(0);
});
Dataset<Row> spanDataFrame = spark.createDataFrame(rowRDD, Span.class);
spanDataFrame.createOrReplaceTempView("span_data_raw");
Dataset<Row> aggregatedSpan =
spark.sql("select " +
"TAGS_APPNAME as applicationname, " +
"from span_data_raw " +
"group by TAGS_APPNAME" +
"");
aggregatedSpan.show();
});