0

everyone.

I have a structured streaming in a Delta Lake. My last table is supposed to count how many unique IDs access a platform per week.

I`m grouping the data by week in the streaming, however, I cannot count the unique values of IDs on the other column and I keep getting the count of the whole bunch even if repeated instead.

I have tried grouping the data twice, by week and then device_id. I have tried dropDuplicate(). Nothing has worked out so far.

Can someone explain me what am I missing?

My code:

from pyspark.sql.functions import weekofyear, col

def silverToGold(silverPath, goldPath, queryName):
    (spark.readStream
    .format("delta")
    .load(silverPath)
    .withColumn("week", weekofyear("client_event_time"))
    .groupBy(col("week"))
    .count()
    .select(col("week"),col("count").alias("WAU"))
    .writeStream 
    .format("delta")
    .option("checkpointLocation", goldPath + "/_checkpoint")
    .queryName(queryName)
    .outputMode("complete")
    .start(goldPath))
Gilmar Neves
  • 37
  • 10

1 Answers1

1

The following code worked out.

Used approx_count_distinct and rsd=0.01.

from pyspark.sql.functions import weekofyear, approx_count_distinct

def silverToGold(silverPath, goldPath, queryName):
  (spark.readStream
   .format("delta")
   .load(silverPath)
   .withColumn("week", weekofyear(col("eventDate")))
   .groupBy(col("week"))
   .agg(approx_count_distinct("device_id",rsd=0.01).alias("WAU"))
   .select("week","WAU")
   .writeStream
   .format("delta")
   .option("checkpointLocation", goldPath + "/_checkpoint")
   .queryName(queryName)
   .outputMode("complete") 
   .start(goldPath))
Gilmar Neves
  • 37
  • 10