1

I am new to Spark.

I have a spark streaming batch job(maybe it should be structure streaming) which receive data from kafka horuly.

And I found my spark keeps consuming data and would not stop.

So I want to control it, for example,

Now it is 3 am, and my spark should consume data between 2~3 am from kafka topic, next hour should consume 3~4 am 

Any idea about it? Thanks.

--------Code---------

 SparkConf sparkConf = new SparkConf().setAppName("CalculateHourlyFromKafka");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.minutes(60));
JavaInputDStream<ConsumerRecord<String, String>> stream = KafkaUtils.createDirectStream(
                jssc,
                LocationStrategies.PreferConsistent(),
                ConsumerStrategies.Subscribe(topicsSet, kafkaParams));
 stream.foreachRDD((rdd, time) -> {
            SparkSession spark = JavaSparkSessionSingleton.getInstance(rdd.context().getConf());
            JavaRDD<Span> rowRDD = rdd.map(message -> {
                Reader stringReader = new StringReader(message.value());
                List<Span> spanList = new CsvToBeanBuilder(stringReader).withType(Span.class).build().parse();
                return spanList.get(0);
            });
            Dataset<Row> spanDataFrame = spark.createDataFrame(rowRDD, Span.class);

            spanDataFrame.createOrReplaceTempView("span_data_raw");

            Dataset<Row> aggregatedSpan =
                    spark.sql("select " +
                            "TAGS_APPNAME as applicationname, " +
                            "from span_data_raw " +
                            "group by TAGS_APPNAME" +
                            "");
            aggregatedSpan.show();
        });
GentleChen
  • 11
  • 2

0 Answers0