I submit spark streaming job to calculate site pages UV by mapWithState
func, Complete batches has very instability processing time from beginning.
Those batches(>45s) took long time at first stage in job. But first stage is only include receive message from kafka brokers(CreateDirectStream
) and simple map
for test.
stream
.map(
x => {
Tuple2[String,Tuple3[Int,HLL,String]](getMd5(0.0 + ";" + getFormatDate(1600000000l) + ";" +"1;1;1;1;1;1;1;1;1;1;1;1"),Tuple3(1, hyperLogLog(getMd5("1").getBytes(Charsets.UTF_8)), "0.0" + ";" + getFormatDate(1600000000l) + ";" +"1;1;1;1;1;1;1;1;1;1;1;1"))
})
I don't known why some batches took 40+ seconds to frist stage. Is kafka receiving instablity? Kafka and Spark In the LAN.