I have created a topic in kafka called "test" which has just one partition and it is not replicated.
I have created a Kafka producer that writes on the topic "test" the following string: "A B C A" in a cycle of 100000 iterations. There is a 1000 ms of sleep between the iterations (Thread.sleep). The key is the index of each cycle's iteration.
I have run the following code both on Centos 7 and on Windows. I usually build a fat jar using maven assembly plugin and run it with spark-submit. I always have to specifiy the packages when submitting the jar: --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0
public class StreamFromKafka {
public static void es() throws StreamingQueryException {
SparkSession session = SparkSession.builder().appName("streamFromKafka").master("local[*]").getOrCreate();
String columnName = "value";
Dataset<Row> df = session.readStream().format("kafka")
.option("group.id","test-consumer-group")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test").load();
Dataset<Row> df1 = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)").select(columnName);
Dataset<String> words = df1.as(Encoders.STRING()).flatMap(line -> Arrays.asList(line.split(" ")).iterator(), Encoders.STRING());
//comment1 --> StreamingQuery query0 = words.writeStream().outputMode("update").format("console").start();
//comment2 --> query0.awaitTermination();
Dataset<Row> wordCount = words.groupBy("value").count();
StreamingQuery query = wordCount.writeStream().outputMode("update").format("console").start();
query.awaitTermination();
}
}
If I decomment "comment1" and "comment2" in the above code, the table is printed fast on windows:
-------------------------------------------
Batch: 5
-------------------------------------------
+-----+
|value|
+-----+
| A|
| B|
| C|
| A|
| A|
| B|
| C|
| A|
+-----+
However, if I comment comment1 and comment2, mini batches seem long lasting on Windows.
So I can conclude that the stream DOES read from Kafka on Windows, but group by takes lots of time.
I left running this implementation more time on windows than on Linux, yesterday evening at 20:46. It has very longer mini batches (real time streaming is built with mini batches under the hood of structured streaming API) on Windows. So, for example, as you can see in the following picture, it takes one minute to execute two batches:
As you can see in the following picture, it takes three minutes to execute four batches:
It is quicker on Linux. Since I tried it in Linux firstly, I expected less time on Windows to see the console output, then, since I dind't see anything, I thought it wasn't working.
I should time mini batches on Linux in order to compare the behaviours.