I have got an input dataset with schema:
ID | Name | Rating |
---|---|---|
42 | "Book name 1" | "liked it" |
57 | "Book name 2" | "really liked it" |
This dataset is placed on HDFS
in parquet
. I need to calculate rows counts for each book and write result to other parquet.
When I do this aggregation with console output, everything seems to be OK:
rows_count_query =\
spark\
.readStream\
.schema(user_rating_schema)\
.parquet(path=src_data_path)\
.groupBy("Name")\
.agg(fun.count(fun.col("Name")).alias("RowCount"))\
.writeStream\
.format("console")\
.outputMode("complete")\
.start()
Output:
-------------------------------------------
Batch: 0
-------------------------------------------
+--------------------+--------+
| Name|RowCount|
+--------------------+--------+
|Annabel (Delirium...| 1|
| Plays and Masques| 1|
| Sein und Zeit| 1|
|A Blind Man Can S...| 6|
|Notes From a Defe...| 1|
| سگ کشی، فیلمنامه| 25|
| Goldfish| 2|
|The Gilded Chain ...| 1|
|Duct Tape and a T...| 1|
|The Ugliest House...| 1|
|On Stalin's Team:...| 1|
|Normative Theory ...| 1|
|پاداش آخر سال - م...| 3|
|14 Minutes: A Run...| 1|
|Emergency: This B...| 2|
| Whitney, My Love| 3|
|Archangel's Proph...| 1|
|The Dialogues of ...| 2|
|Black Cat (Gemini...| 2|
| Easter| 1|
+--------------------+--------+
only showing top 20 rows
I want to have this result written in parquet
.
To do it, I do the following:
rows_count_query =\
spark\
.readStream\
.schema(user_rating_schema)\
.parquet(path=input_path)\
.select("Name", fun.current_timestamp().alias("CurrentTime"))\
.withWatermark("CurrentTime", delayThreshold="1 minute")\
.groupBy("Name", "CurrentTime")\
.agg(fun.count(fun.col("Name")).alias("RowCount"))\
.writeStream\
.format("parquet")\
.option("path", output_path)\
.option("checkpointLocation", "/tmp/checkpoint")\
.outputMode("append")\
.start()
Reading output path as Spark SQL
dataset shows that it is empty:
df = spark.read.parquet(putput_path)
df.show()
Output:
+----+-----------+--------+
|Name|CurrentTime|RowCount|
+----+-----------+--------+
+----+-----------+--------+
I added timestemp with current time as it is required for watermark
, which is requierd for append
output mode when doing aggregations. Why there is nothing written in output file? What to do to have result written in parquet
?