0

I use Pyspark to readStream from kafka, process and writeStream to delta table.

pyspark 3.2.1
io.delta 1.2.2
hadoop 3.3.0

This code does not produce any results to output delta when deployed in kubernetes or running in databricks.

am I producing no data?

When I run display without writeStream part in databricks then I see the data.

What's happening?

def run(spark, window_duration, watermark_delay):
    input_time_col = "timestamp"

    keep_original_cols = [input_time_col, "topic"]

    raw_message_data = StructType([
        StructField("col1", StringType(), True),
        StructField("col2", StringType(), True),
        StructField("col3", StringType(), True),
        StructField("col4", IntegerType(), True),
        StructField("col5", IntegerType(), True),
    ])

    return (spark
            .readStream
            .format("kafka")
            .option("kafka.bootstrap.servers", KAFKA_SERVERS)
            .option("subscribe", INPUT_TOPIC)
            .option("startingOffsets", STARTING_OFFSETS)
            .option("maxOffsetsPerTrigger", MAX_OFFSETS_PER_TRIGGER)
            .option("failOnDataLoss", FAIL_ON_DATA_LOSS)
            .option("minPartitions", MIN_PARTITIONS)
            .load()
            .withColumn("tmp", from_json(col("value").cast("string"), raw_message_data))
            .select(f"tmp.*", *keep_original_cols)
            .withWatermark(input_time_col, watermark_delay)
            .groupBy(
                window(col(input_time_col), window_duration).alias("period"),
            )
            .agg(
                count("*").alias("query_count")
            )
            .withColumn("period_start", expr("period.start"))
            .withColumn("date", expr("date(period_start)"))
            .withColumn("hour", expr("hour(period_start)"))
            .withColumn("minute", expr("minute(period_start)"))
            .writeStream
            .outputMode(OUTPUT_MODE)
            # .partitionBy("date", "hour")
            .format(OUTPUT_FORMAT)
            .option("mergeSchema", "true")
            .option("checkpointLocation", CHECKPOINT_LOCATION))


query = run(spark, "2 minutes", "1 minuteS")
query.start(OUTPUT_PATH).awaitTermination()

I see _delta_log but no data appended.

EDIT: constants:

KAFKA_SERVERS = "...my kafka servers..."
INPUT_TOPIC = "some-topic"
MAX_OFFSETS_PER_TRIGGER = "1000"
STARTING_OFFSETS = "latest"
FAIL_ON_DATA_LOSS = "false"
MIN_PARTITIONS = "288"

WINDOW_DURATION = "2 minutes"
WATERMARK_DELAY = "30 seconds"

OUTPUT_FORMAT = "delta"
OUTPUT_MODE = "append"
CHECKPOINT_LOCATION = "wasbs://...someCheckpointLocation"
OUTPUT_TABLE_PATH = "wasbs://....blob.core.windows.net/output"

PARTITIONING_COLS = ["col1", "col2"]

EDIT2:

running this part in databricks works fine:

    df = (
    spark
    .readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", KAFKA_SERVERS)
    .option("subscribe", INPUT_TOPIC)
    .option("startingOffsets", STARTING_OFFSETS)
    .option("maxOffsetsPerTrigger", MAX_OFFSETS_PER_TRIGGER)
    .option("failOnDataLoss", FAIL_ON_DATA_LOSS)
    .option("minPartitions", MIN_PARTITIONS)
    .load()
    .withColumn("tmp", from_json(col("value").cast("string"), raw_message_data))
    .select(f"tmp.*", *keep_original_cols)
    .withWatermark(input_time_col, watermark_delay)
    .groupBy(
        window(col(input_time_col), window_duration).alias("period"),
    )
    .agg(
        count("*").alias("query_count")
    )
    .withColumn("period_start", expr("period.start"))
    .withColumn("date", expr("date(period_start)"))
    .withColumn("hour", expr("hour(period_start)"))
    .withColumn("minute", expr("minute(period_start)"))
)
display(df)
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Dariusz Krynicki
  • 2,544
  • 1
  • 22
  • 47
  • what is the value of `OUTPUT_MODE` and other constants not shown here? How long did you wait? – Alex Ott Aug 30 '22 at 13:14
  • @AlexOtt I have added constants. – Dariusz Krynicki Aug 30 '22 at 13:23
  • if you wait less than 3 minutes (window duration + watermark) then data won't be populated... – Alex Ott Aug 30 '22 at 13:28
  • @AlexOtt agreed hence I waited long enough ... please see EDIT2...running the code without writeStream shows me the data I want to write. – Dariusz Krynicki Aug 30 '22 at 13:33
  • @AlexOtt is the problem caused by me using tumbling window in groupBy window function and no trigger as well? – Dariusz Krynicki Aug 31 '22 at 12:56
  • wow, i have found that when i use "earlies" starting offset then it all works fine. why it does not work with latest when the topic has 1 million messages flowing into per second? it works fine with such implementation in spark+scala. – Dariusz Krynicki Aug 31 '22 at 13:50
  • Hmmm... never saw such behavior – Alex Ott Aug 31 '22 at 14:00
  • 2
    @AlexOtt I nailed it ... i set maxOffsetsPerTrigger to 1000 ... but i have 10^6 msgs per second flowing into the topic ... so i throttled unintentionaly the througput causing processing window to actually never close ... so it caused data to be all the time in memory and never spoiled to the output / appended to the output / window close was never triggered as i was processing data too slow ... i changed the value from 1k to 50M and it works fine. thanks anyway. – Dariusz Krynicki Aug 31 '22 at 14:19
  • interesting case... – Alex Ott Aug 31 '22 at 14:28
  • @DariuszKrynicki - can you please answer your own question, so other folks can easily see that this isn't an open question anymore? – Powers Sep 01 '22 at 01:26

1 Answers1

0

I throttled the througput by setting maxOffsetsPerTrigger for testing purposes as I limited number of executors to 3 while I have actually 10^6 messages incomming per second to the input topic.

This caused slow processing of incomming messages and window to never close so never appended the data.

Chaning maxOffsetsPerTrigger to 50,000,000 allowed me to process messages normally and the whole pipeline works.

Dariusz Krynicki
  • 2,544
  • 1
  • 22
  • 47