0

I have got an input dataset with schema:

ID Name Rating
42 "Book name 1" "liked it"
57 "Book name 2" "really liked it"

This dataset is placed on HDFS in parquet. I need to calculate rows counts for each book and write result to other parquet.

When I do this aggregation with console output, everything seems to be OK:

rows_count_query =\
    spark\
        .readStream\
        .schema(user_rating_schema)\
        .parquet(path=src_data_path)\
        .groupBy("Name")\
        .agg(fun.count(fun.col("Name")).alias("RowCount"))\
        .writeStream\
        .format("console")\
        .outputMode("complete")\
        .start()

Output:

-------------------------------------------
Batch: 0
-------------------------------------------
+--------------------+--------+
|                Name|RowCount|
+--------------------+--------+
|Annabel (Delirium...|       1|
|   Plays and Masques|       1|
|       Sein und Zeit|       1|
|A Blind Man Can S...|       6|
|Notes From a Defe...|       1|
|   سگ کشی، فیلم‌نامه|      25|
|            Goldfish|       2|
|The Gilded Chain ...|       1|
|Duct Tape and a T...|       1|
|The Ugliest House...|       1|
|On Stalin's Team:...|       1|
|Normative Theory ...|       1|
|پاداش آخر سال - م...|       3|
|14 Minutes: A Run...|       1|
|Emergency: This B...|       2|
|    Whitney, My Love|       3|
|Archangel's Proph...|       1|
|The Dialogues of ...|       2|
|Black Cat (Gemini...|       2|
|              Easter|       1|
+--------------------+--------+
only showing top 20 rows

I want to have this result written in parquet.

To do it, I do the following:

rows_count_query =\
    spark\
        .readStream\
        .schema(user_rating_schema)\
        .parquet(path=input_path)\
        .select("Name", fun.current_timestamp().alias("CurrentTime"))\
        .withWatermark("CurrentTime", delayThreshold="1 minute")\
        .groupBy("Name", "CurrentTime")\
        .agg(fun.count(fun.col("Name")).alias("RowCount"))\
        .writeStream\
        .format("parquet")\
        .option("path", output_path)\
        .option("checkpointLocation", "/tmp/checkpoint")\
        .outputMode("append")\
        .start()

Reading output path as Spark SQL dataset shows that it is empty:

df = spark.read.parquet(putput_path)
df.show()

Output:

+----+-----------+--------+
|Name|CurrentTime|RowCount|
+----+-----------+--------+
+----+-----------+--------+

I added timestemp with current time as it is required for watermark, which is requierd for append output mode when doing aggregations. Why there is nothing written in output file? What to do to have result written in parquet?

1 Answers1

0

Make sure to adjust the trigger interval (continuous="1 second") based on your requirements.

rows_count_query =\
    spark\
        .readStream\
        .schema(user_rating_schema)\
        .parquet(path=input_path)\
        .select("Name", fun.current_timestamp().alias("CurrentTime"))\
        .withWatermark("CurrentTime", delayThreshold="1 minute")\
        .groupBy("Name", "CurrentTime")\
        .agg(fun.count(fun.col("Name")).alias("RowCount"))\
        .writeStream\
        .format("parquet")\
        .option("path", output_path)\
        .option("checkpointLocation", "/tmp/checkpoint")\
        .outputMode("append")\
        .trigger(continuous="1 second")\ # add this line
        .start()

By making these modifications, your streaming query should continuously process the incoming data, calculate the row counts, and write the results to the specified parquet output path.

  • According to this [page](https://stackoverflow.com/questions/50952042/continuous-trigger-not-found-in-structured-streaming), triggers are not supported for `parquet` data source. When I tried adding `.trigger(continuous="1 second")\ # add this line` I ve recieved an error message: `Py4JJavaError: An error occurred while calling o246.start. : java.lang.IllegalStateException: Unknown type of trigger: ContinuousTrigger(1000)...`. – Георгий Гуминов Jun 20 '23 at 08:05