0

I have got a source dataset with the following structure:

ID Name Rating
42 Book name 1
53 Other name 3
... ... ...

It is stored in HDFS as parquet.

I need to calculate average rating for each book. My code to do it is:

import pyspark
import pyspark.sql.functions as fun
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

user_rating_schema = StructType([
    StructField('ID', StringType()),
    StructField('Name', StringType()),
    StructField('Rating', IntegerType())
])

...

spark\
    .readStream\
    .schema(user_rating_schema)\
    .parquet(path=path_to_src)\
    .select(fun.col("Name"), fun.col("Rating"))\
    .groupBy("Name")\
    .agg(fun.mean("Rating").alias("MeanRating"))\
    .writeStream\
    .format("parquet")\
    .option("path", path_to_sink)\
    .option("checkpointLocation", "/tmp/checkpoint")\
    .start()

I receive the following error message:

AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;

What is watermark? Where do I get it? Must it always be a timestamp? Can I get watermark from ID, or it should be generated?

0 Answers0