0

I developed a Python Kafka producer that sends multiple json records as a nd-json binary string to a Kafka topic. Then I'm trying to read these messages in Spark Structured Streaming with PySpark as follow:

events_df = select(from_json(col("value").cast("string"), schema).alias("value"))

but this code works only with a single json documents. If the value contains multiple records as a newline delimited json, Spark can't decode it correctly.

I don't want to send a kafka message for each single event. How can I achieve this?

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
user3376554
  • 19
  • 2
  • 5
  • 1
    Well, what is `schema` here? And why aren't you able to send individual objects as messages? That schema is going to be the issue if it is just representing a single JSON object... and `from_json` doesn't work on ndjson either. Otherwise, if you can `map` a split function over the records, so that the do become individual records, then that's what you should do here – OneCricketeer Feb 01 '21 at 20:22
  • I've generated the schema automatically by importing a single event json from the disk. How can I split the value of a kafka message and then parse correctly the json with the proper schema? Reading from disk actually supports ndjson and also schema infer – user3376554 Feb 02 '21 at 11:36
  • Like I said, then `schema` is a Struct, not ndjson, which AFAIK, has no valid schema type. The fix is to split the records on new-lines via a flatmap, which will then represent them as individual dataframe rows. You still didn't clarify why you dont want to send individual messages (keeping in mind that Kafka is not meant for "file transfer", so you shouldn't compare reading from disk to consuming from Kafka) – OneCricketeer Feb 02 '21 at 16:29
  • In a realtime streaming flow, with a huge quantity of data coming continuosly, is unthinkable to split the data and sending a single event one by one. – user3376554 Feb 03 '21 at 08:58
  • 1
    Is it? Your answer does exactly that. If you actually want multiple objects in a message, use a proper array – OneCricketeer Feb 03 '21 at 13:40

1 Answers1

0

I managed to do what I was looking for in this way, splitting the full text string by newline and then exploding the array in rows to be parsed with the schema:

    events = spark.readStream \
        .format("kafka") \
        .option("kafka.bootstrap.servers", "localhost:9092") \
        .option("subscribe", "quickstart-events") \
        .option("startingOffsets", "earliest")\
        .load()\
        .selectExpr("CAST(value AS STRING) as data")
    
    events = events.select(explode(split(events.data, '\n')))
    events = events.select(from_json(col("col"), event_schema).alias('value'))
    events = events.selectExpr('value.*')```
user3376554
  • 19
  • 2
  • 5