everyone.
I have a structured streaming in a Delta Lake. My last table is supposed to count how many unique IDs access a platform per week.
I`m grouping the data by week in the streaming, however, I cannot count the unique values of IDs on the other column and I keep getting the count of the whole bunch even if repeated instead.
I have tried grouping the data twice, by week and then device_id. I have tried dropDuplicate(). Nothing has worked out so far.
Can someone explain me what am I missing?
My code:
from pyspark.sql.functions import weekofyear, col
def silverToGold(silverPath, goldPath, queryName):
(spark.readStream
.format("delta")
.load(silverPath)
.withColumn("week", weekofyear("client_event_time"))
.groupBy(col("week"))
.count()
.select(col("week"),col("count").alias("WAU"))
.writeStream
.format("delta")
.option("checkpointLocation", goldPath + "/_checkpoint")
.queryName(queryName)
.outputMode("complete")
.start(goldPath))