aggregation and week of the year in pyspark dataframe

Question

I have below schema in the dataframe

root
 |-- device_id: string (nullable = true)
 |-- eventName: string (nullable = true)
 |-- client_event_time: timestamp (nullable = true)
 |-- eventDate: date (nullable = true)
 |-- deviceType: string (nullable = true)

I want to add below two columns to this dataframe:

WAU: count of weekly active users (distinct device IDs grouped by week)

week: week of year (need to use the appropriate SQL function)

I want to use approx_count_distinct. The optional keyword rsd will need to be set to .01 also.

I tried to start writing something like below , but getting error.

spark.readStream
.format("delta")
.load(inputpath)
.groupBy(weekofyear('eventDate'))
.count()
.distinct()
.writeStream
.format("delta")
.option("checkpointLocation", outputpath)
.outputMode("complete")
.start(outputpath)

AnalysisException: dropDuplicates is not supported after aggregation on a streaming DataFrame/Dataset; However I want to take distinct count of device_ids. @mck — sbs, Nov 13 '20 at 12:48
Another error is org.apache.spark.sql.AnalysisException: Attribute name "weekofyear(eventDate)" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.; @mck — sbs, Nov 13 '20 at 12:50
the second error is obviously easy to solve, no? Just use an alias as suggested by the error message — UninformedUser, Nov 13 '20 at 14:28
yes that has been fixed by using groupBy(weekofyear('eventDate').alias('week')) . However, I want to get the count of distinct device_id group by week and to be named as WAU. — sbs, Nov 13 '20 at 14:55
I tried as below . But getting error like AttributeError: 'GroupedData' object has no attribute 'withColumn' spark.readStream .format("delta") .load(inputpath) .groupBy(weekofyear('eventDate').alias('week')) .countDistinct('device_id').alias('WAU') — sbs, Nov 13 '20 at 15:02
I don't understand what you want to say. You have to call `.agg(countDistinct('device_id'))` instead of `.count() .distinct()` which is totally different given that it would first count per group which would make any `distinct()` afterwards meaningless — UninformedUser, Nov 13 '20 at 15:03
I had already tried with it and was getting error like AnalysisException: Distinct aggregations are not supported on streaming DataFrames/Datasets. Consider using approx_count_distinct() instead.;. That's why i mentioned to use approx_count_distinct() to be used .agg(F.countDistinct('device_id')).alias('WAU') — sbs, Nov 13 '20 at 15:09
spark.readStream .format("delta") .load(outputPathSilver) .groupBy(weekofyear('eventDate').alias('week')) .count() .withColumn("WAU", F.approx_count_distinct('device_id')) I have managed to use the approx_count_distinct, however getting column missing error AnalysisException: cannot resolve '`device_id`' given input columns: [count, week];; — sbs, Nov 13 '20 at 15:34
I still don't understand. you call `count()` after `groupBy()` but this is wrong. It creates groups per week and then just counts the group. Remove the `count()` after `groupBy` — UninformedUser, Nov 13 '20 at 18:03
sorry, that was my mistake. i was playing with old code in the function. i updated the code as below now. spark.readStream .format("delta") .load(outputPathSilver) .groupBy(weekofyear('eventDate').alias('week')) .agg(F.approx_count_distinct('device_id')).alias('WAU') — sbs, Nov 13 '20 at 18:25

score 0 · Accepted Answer · answered Nov 13 '20 at 18:50

As per the discussion, below code worked.

spark.readStream
  .format("delta")
  .load(inputdata)
  .groupBy(weekofyear('eventDate').alias('week'))
  .agg(F.approx_count_distinct('device_id', rsd = .01)).alias('WAU')
  .writeStream
  .format("delta")
  .option("checkpointLocation", outputdata)
  .outputMode("complete")
  .start(outputdata)

score 0 · Answer 2 · answered Feb 16 '22 at 13:27

Thanks, I struggled through this as well. for me, the code you shared worked but with a slight modification.

spark.readStream
      .format("delta")
      .load(inputdata)
      .groupBy(weekofyear('eventDate').alias('week'))
      .agg(approx_count_distinct('device_id', rsd = .01).alias('WAU'))
      .writeStream
      .format("delta")
      .queryName(queryName)
      .option("checkpointLocation", f"{outputdata}_checkpoint")
      .outputMode("complete")
      .start(outputdata)

Note where I have added alias for approx_count_distinct. :)

aggregation and week of the year in pyspark dataframe

2 Answers2