I have below schema in the dataframe
root
|-- device_id: string (nullable = true)
|-- eventName: string (nullable = true)
|-- client_event_time: timestamp (nullable = true)
|-- eventDate: date (nullable = true)
|-- deviceType: string (nullable = true)
I want to add below two columns to this dataframe:
WAU: count of weekly active users (distinct device IDs grouped by week)
week: week of year (need to use the appropriate SQL function)
I want to use approx_count_distinct. The optional keyword rsd will need to be set to .01 also.
I tried to start writing something like below , but getting error.
spark.readStream
.format("delta")
.load(inputpath)
.groupBy(weekofyear('eventDate'))
.count()
.distinct()
.writeStream
.format("delta")
.option("checkpointLocation", outputpath)
.outputMode("complete")
.start(outputpath)