Kafka Integration with Pyspark Structured Streaming job stuck in [*] (with jupyter)

Question

After installing Pyspark and testing it it works fine and adding the right connector for the kafka integration, now when I try to load the date from kafka from another machine in the same network and start the job, it gets stuck in [*], no error, no nothing, I don't understand the issue here , so please if anyone can help me, and here is my code :

import os

os.environ['PYSPARK_SUBMIT_ARGS'] = f'--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 pyspark-shell'

import findspark

findspark.init()

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
import time

kafka_topic_name = "test-spark"
kafka_bootstrap_servers = '192.168.1.3:9092'

spark = SparkSession \
    .builder \
    .appName("PySpark Structured Streaming with Kafka and Message Format as JSON") \
    .master("local[*]") \
    .getOrCreate()

# Construct a streaming DataFrame that reads from TEST-SPARK
df = spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", kafka_bootstrap_servers) \
    .option("subscribe", kafka_topic_name) \
    .load()

print("Printing Schema of df: ")
df.printSchema()


df1 = df.selectExpr("CAST(value AS STRING)", "timestamp")
df1.printSchema()

 schema = StructType() \
        .add("name", StringType()) \
        .add("type", StringType())

df2 = df1\
        .select(from_json(col("value"), schema)\
        .alias("records"), "timestamp")
    df3 = df2.select("records.*", "timestamp")

  print("Printing Schema of records_df3: ")
    df3.printSchema()

 records_write_stream = df3 \
        .writeStream \
        .trigger(processingTime='5 seconds') \
        .outputMode("update") \
        .option("truncate", "false")\
        .format("console") \
        .start()
    records_write_stream.awaitTermination()

    print("Stream Data Processing Application Completed.")

- the command for kafka console producer that i tried with
$ ./bin/kafka-console-producer --broker-list localhost:9092 --topic test_spark \
--property value.schema='{"type":"record","name":"myrecord","fields":[{"name":"f1","type":"string"}]}'

>{"f1": "value1"}
>{"f1": "value2"}
>{"f1": "value3"}

when I try to lead kafka it doesn't show any error and continue but when I try to start the job it gets stuck
PS: and the weird thing is I tried to run this code while the machine hosting kafka is down, and it loaded kafka i.e,: this code was without error:

# Construct a streaming DataFrame that reads from TEST-SPARK
df = spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", kafka_bootstrap_servers) \
    .option("subscribe", kafka_topic_name) \
    .load()

and continue until it gets stuck again in the last piece of code as shown above which is weird
so please any suggestions?

@OneCricketeer i know, but even when im producing from kafka console it doesn t seem that spark is catching data from kafka — ABDELOUAHAB AMINE TAFAT, Jun 28 '21 at 22:22
Perhaps you are missing `startingPosition=earliest`? Otherwise, you need to run the producer after the consumer started as the consumer defaults to start at the end of the topic — OneCricketeer, Jun 28 '21 at 22:24

Kafka Integration with Pyspark Structured Streaming job stuck in [*] (with jupyter)

0 Answers0