I have got a source dataset with the following structure:
ID | Name | Rating |
---|---|---|
42 | Book name | 1 |
53 | Other name | 3 |
... | ... | ... |
It is stored in HDFS
as parquet
.
I need to calculate average rating for each book. My code to do it is:
import pyspark
import pyspark.sql.functions as fun
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
user_rating_schema = StructType([
StructField('ID', StringType()),
StructField('Name', StringType()),
StructField('Rating', IntegerType())
])
...
spark\
.readStream\
.schema(user_rating_schema)\
.parquet(path=path_to_src)\
.select(fun.col("Name"), fun.col("Rating"))\
.groupBy("Name")\
.agg(fun.mean("Rating").alias("MeanRating"))\
.writeStream\
.format("parquet")\
.option("path", path_to_sink)\
.option("checkpointLocation", "/tmp/checkpoint")\
.start()
I receive the following error message:
AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;
What is watermark? Where do I get it? Must it always be a timestamp? Can I get watermark from ID
, or it should be generated?