pyspark: new column name for an aggregated field

Question

I have the following code with some aggregation function:

new_df = my_df.groupBy('id').agg({"id": "count", "money":"max"})

Then the new column I have are COUNT(id) and MAX(money). Can I specify the column names myself instead of using the default one? E.g. I want them to be called my_count_id and my_max_money. How do I do that? Thanks!

I've been using `withColumnRenamed` but it's not a very elegant solution. — David, Aug 31 '16 at 18:52

score 4 · Answer 1 · answered Aug 31 '16 at 21:38

4

Use columns not dict:

>>> from pyspark.sql.functions import *
>>> my_df.groupBy('id').agg(count("id").alias("some name"), max("money").alias("some other name"))

answered Aug 31 '16 at 21:38

score 2 · Answer 2 · answered Aug 31 '16 at 21:45

2

Maybe something like:

new_df = my_df.groupBy('id') \
    .agg({"id": "count", "money": "max"}) \
    .withColumnRenamed("COUNT(id)", "my_count_id") \
    .withColumnRenamed("MAX(money)", "my_max_money")

or:

import pyspark.sql.functions as func

new_df = my_df.groupBy('id') \
    .agg(func.count("id").alias("my_count_id"),
         func.max("money").alias("my_max_money"))

answered Aug 31 '16 at 21:45

neocortex

359
2
11

Assuming one aggregate function, say func.sum, is there an efficient way to groupby and alias when there is, say, 1k columns? My current workaround: `X = df.columns[1:] new_cols = [df.columns[0]] + [x+'_summed' for x in X] exprs = {x: "sum" for x in X} dg = df.groupBy("col1").agg(exprs).toDF(*new_cols)` – Quetzalcoatl Apr 13 '18 at 19:30

pyspark: new column name for an aggregated field

2 Answers2