1

I have created a topic in kafka called "test" which has just one partition and it is not replicated.

I have created a Kafka producer that writes on the topic "test" the following string: "A B C A" in a cycle of 100000 iterations. There is a 1000 ms of sleep between the iterations (Thread.sleep). The key is the index of each cycle's iteration.

I have run the following code both on Centos 7 and on Windows. I usually build a fat jar using maven assembly plugin and run it with spark-submit. I always have to specifiy the packages when submitting the jar: --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0

public class StreamFromKafka {

    public static void es() throws StreamingQueryException {
        SparkSession session = SparkSession.builder().appName("streamFromKafka").master("local[*]").getOrCreate();

        String columnName = "value";

        Dataset<Row> df = session.readStream().format("kafka")
                .option("group.id","test-consumer-group")
                .option("kafka.bootstrap.servers", "localhost:9092")
                .option("subscribe", "test").load();

        Dataset<Row> df1 = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)").select(columnName);

        Dataset<String> words = df1.as(Encoders.STRING()).flatMap(line -> Arrays.asList(line.split(" ")).iterator(), Encoders.STRING());

        //comment1 --> StreamingQuery query0 = words.writeStream().outputMode("update").format("console").start();

        //comment2 --> query0.awaitTermination();

        Dataset<Row> wordCount = words.groupBy("value").count();

        StreamingQuery query = wordCount.writeStream().outputMode("update").format("console").start();

        query.awaitTermination();

    }

}

If I decomment "comment1" and "comment2" in the above code, the table is printed fast on windows:

-------------------------------------------
Batch: 5
-------------------------------------------
+-----+
|value|
+-----+
|    A|
|    B|
|    C|
|    A|
|    A|
|    B|
|    C|
|    A|
+-----+

However, if I comment comment1 and comment2, mini batches seem long lasting on Windows.

So I can conclude that the stream DOES read from Kafka on Windows, but group by takes lots of time.

I left running this implementation more time on windows than on Linux, yesterday evening at 20:46. It has very longer mini batches (real time streaming is built with mini batches under the hood of structured streaming API) on Windows. So, for example, as you can see in the following picture, it takes one minute to execute two batches:

enter image description here

As you can see in the following picture, it takes three minutes to execute four batches:

enter image description here

It is quicker on Linux. Since I tried it in Linux firstly, I expected less time on Windows to see the console output, then, since I dind't see anything, I thought it wasn't working.

I should time mini batches on Linux in order to compare the behaviours.

Peter
  • 399
  • 2
  • 6
  • 23
  • 1
    You ran the exact same code in Centos and Windows without changing the group id? So therefore, if you're not producing data, then the windows consumer would waiting for new events... Java is cross platform, so blaming the OS probably is not the issue when you're not dealing with low level details – OneCricketeer Dec 05 '18 at 02:19
  • 1
    I edited in order to try to be more clear: "I have created a Kafka producer that writes on the topic "test" the following string: "A B C A" in a cycle of 100000 iterations. There is a 1000 ms of sleep between the iterations (Thread.sleep). The key is the index of each cycle's iteration". As I wrote, the code is the same. Anyway, I discover it works on windows. It just have longer mini batches. – Peter Dec 05 '18 at 08:23
  • @cricket_007, see my "EDIT" section – Peter Dec 05 '18 at 08:49
  • 1
    I feel you should rather put that as an answer below rather than edit the question – OneCricketeer Dec 05 '18 at 14:03
  • @cricket_007 The point is that I can't explain why mini batches are longer...and I'd like not only a confirm, but also an explanation by some expert user. – Peter Dec 05 '18 at 15:11
  • Moreover, I should time the implementations. – Peter Dec 05 '18 at 15:17
  • I wouldn't say I'm an expert user, but I can run code on windows – OneCricketeer Dec 05 '18 at 15:37
  • So at the very least you need to change the title of the question! – thebluephantom Dec 05 '18 at 20:50
  • @thebluephantom, yours seems a good advice. I have just edited my question. – Peter Dec 06 '18 at 13:35

0 Answers0