In pyspark, I try to count the unique occurences of a user ID in json logs (the dataset is a json file).
The following works:
df.select(
F.col("nested.user_id")
)\
.where(
...
)\
.groupBy(
F.col("user_id")
)\
.count()
Notice that the "nested." prefix does not appear in the groupBy clause. It seems to be automatically removed by spark. I need this prefix to appear and tried the following query:
df.select(
F.col("nested.user_id").alias("nested.user_id")
)\
.where(
...
)\
.groupBy(
F.col("nested.user_id")
)\
.count()
The alias seems to work but the groupBy does not know about it:
org.apache.spark.sql.AnalysisException: cannot resolve '`nested.user_id`' given input columns: [nested.user_id];
Any idea? Thanks