1

Please note that you may run this locally only if you already install spark by running the following command. Otherwise replicate the issue on a Databricks cluster which will initialize a spark context automatically.

from pyspark.sql import SparkSession

spark =  SparkSession.builder.appName("test").getOrCreate()

sc = spark.sparkContext

dataframe

spark_dataframe = pd.DataFrame( 
                           {'id' : ['001', '001', '001', '001', '001', '002', '002', '002'],
                            'OuterSensorConnected':[0, 0, 0, 1, 0, 0, 0, 1], 
                            'OuterHumidity':[31.784826, 32.784826, 33.784826, 43.784826, 23.784826, 54.784826, 31.784826, 31.784826],
                            'EnergyConsumption': [70, 70, 70, 70, 70, 70, 70, 70],
                            'DaysDeploymentDate': [10, 20, 21, 31, 41, 11, 19, 57]}
                           )
spark_dataframe = spark.createDataFrame(spark_dataframe)

my issue

I group by the data by ID and I want to aggregations applied to be in a function. Because the same aggregations are applied in many different applications, so this creates a modularity in the code. The function of the aggregations:

import pyspark.sql.functions as sql_function

def data_aggregation():
    sql_function.mean("OuterSensorConnected").alias("OuterSensorConnected"),
    sql_function.mean("OuterHumidity").alias("averageOuterHumidity"),
    sql_function.mean("EnergyConsumption").alias("EnergyConsumption"),
    sql_function.max("DaysDeploymentDate").alias("maxDaysDeploymentDate")

spark_dataframe =spark_dataframe .groupBy("id")\
                                 .agg(data_aggregation())

The above execution code gives me the following error:

"AssertionError: all exprs should be Column"

enter image description here

PS: I know that I could make a function the whole groupBy statement. However, this changes from an application to another application, so I wanted modularize only the fixed part which was the columns to aggregate. Is there a way to fix this? If not I can bear with it but I wanna know the why.

NikSp
  • 1,262
  • 2
  • 19
  • 42

1 Answers1

3

You have to correct the syntax in the agg() command.

def data_aggregation():
    expr =[sql_function.mean("OuterSensorConnected").alias("OuterSensorConnected"),
    sql_function.mean("OuterHumidity").alias("averageOuterHumidity"),
    sql_function.mean("EnergyConsumption").alias("EnergyConsumption"),
    sql_function.max("DaysDeploymentDate").alias("maxDaysDeploymentDate")]
    return(expr)

spark_dataframe =spark_dataframe .groupBy("id")\
                                 .agg(*(data_aggregation()))

Refer this answer for a better method to modularize your function calls: pyspark: groupby and aggregate avg and first on multiple columns

If you want to apply your custom function to grouped data, google pandas user defined aggregate function.

Raghu
  • 1,644
  • 7
  • 19