Pyspark, groupBy and nested columns prefix

Question

In pyspark, I try to count the unique occurences of a user ID in json logs (the dataset is a json file).

The following works:

df.select(
        F.col("nested.user_id")
    )\
    .where(
        ...
    )\
    .groupBy(
        F.col("user_id")
    )\
    .count()

Notice that the "nested." prefix does not appear in the groupBy clause. It seems to be automatically removed by spark. I need this prefix to appear and tried the following query:

df.select(
        F.col("nested.user_id").alias("nested.user_id")
    )\
    .where(
        ...
    )\
    .groupBy(
        F.col("nested.user_id")
    )\
    .count()

The alias seems to work but the groupBy does not know about it:

org.apache.spark.sql.AnalysisException: cannot resolve '`nested.user_id`' given input columns: [nested.user_id];

Any idea? Thanks

try wrapping the alias in backticks: `.alias("\`nested.user_id\`")` — pault, Jun 11 '19 at 22:29
Possible duplicate of [Column name with dot spark](https://stackoverflow.com/questions/44367019/column-name-with-dot-spark), [DataFrame columns names conflict with .(dot)](https://stackoverflow.com/questions/49032451/dataframe-columns-names-conflict-with-dot), [pyspark access column of dataframe with a dot '.'](https://stackoverflow.com/questions/37251653/pyspark-access-column-of-dataframe-with-a-dot) — pault, Jun 12 '19 at 18:11

score 1 · Answer 1 · answered Jun 12 '19 at 08:37

Thank to @pault 's comment, here is the answer:

df.select(
        F.col("nested.user_id").alias("nested.user_id")
    )\
    .where(
        ...
    )\
    .groupBy(
        F.col("`nested.user_id`")
    )\
    .count()

Backticks have been added in the groupBy clause.

Pyspark, groupBy and nested columns prefix

1 Answers1