2

In pyspark, I try to count the unique occurences of a user ID in json logs (the dataset is a json file).

The following works:

df.select(
        F.col("nested.user_id")
    )\
    .where(
        ...
    )\
    .groupBy(
        F.col("user_id")
    )\
    .count()

Notice that the "nested." prefix does not appear in the groupBy clause. It seems to be automatically removed by spark. I need this prefix to appear and tried the following query:

df.select(
        F.col("nested.user_id").alias("nested.user_id")
    )\
    .where(
        ...
    )\
    .groupBy(
        F.col("nested.user_id")
    )\
    .count()

The alias seems to work but the groupBy does not know about it:

org.apache.spark.sql.AnalysisException: cannot resolve '`nested.user_id`' given input columns: [nested.user_id];

Any idea? Thanks

Raphael
  • 1,709
  • 2
  • 17
  • 27
  • try wrapping the alias in backticks: `.alias("\`nested.user_id\`")` – pault Jun 11 '19 at 22:29
  • Thank you @pault ! F.col("`nested.user_id`") worked :) – Raphael Jun 12 '19 at 08:35
  • Possible duplicate of [Column name with dot spark](https://stackoverflow.com/questions/44367019/column-name-with-dot-spark), [DataFrame columns names conflict with .(dot)](https://stackoverflow.com/questions/49032451/dataframe-columns-names-conflict-with-dot), [pyspark access column of dataframe with a dot '.'](https://stackoverflow.com/questions/37251653/pyspark-access-column-of-dataframe-with-a-dot) – pault Jun 12 '19 at 18:11

1 Answers1

1

Thank to @pault 's comment, here is the answer:

df.select(
        F.col("nested.user_id").alias("nested.user_id")
    )\
    .where(
        ...
    )\
    .groupBy(
        F.col("`nested.user_id`")
    )\
    .count()

Backticks have been added in the groupBy clause.

Raphael
  • 1,709
  • 2
  • 17
  • 27