Please note that you may run this locally only if you already install spark by running the following command. Otherwise replicate the issue on a Databricks cluster which will initialize a spark context automatically.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("test").getOrCreate()
sc = spark.sparkContext
dataframe
spark_dataframe = pd.DataFrame(
{'id' : ['001', '001', '001', '001', '001', '002', '002', '002'],
'OuterSensorConnected':[0, 0, 0, 1, 0, 0, 0, 1],
'OuterHumidity':[31.784826, 32.784826, 33.784826, 43.784826, 23.784826, 54.784826, 31.784826, 31.784826],
'EnergyConsumption': [70, 70, 70, 70, 70, 70, 70, 70],
'DaysDeploymentDate': [10, 20, 21, 31, 41, 11, 19, 57]}
)
spark_dataframe = spark.createDataFrame(spark_dataframe)
my issue
I group by the data by ID and I want to aggregations applied to be in a function. Because the same aggregations are applied in many different applications, so this creates a modularity in the code. The function of the aggregations:
import pyspark.sql.functions as sql_function
def data_aggregation():
sql_function.mean("OuterSensorConnected").alias("OuterSensorConnected"),
sql_function.mean("OuterHumidity").alias("averageOuterHumidity"),
sql_function.mean("EnergyConsumption").alias("EnergyConsumption"),
sql_function.max("DaysDeploymentDate").alias("maxDaysDeploymentDate")
spark_dataframe =spark_dataframe .groupBy("id")\
.agg(data_aggregation())
The above execution code gives me the following error:
"AssertionError: all exprs should be Column"
PS: I know that I could make a function the whole groupBy statement. However, this changes from an application to another application, so I wanted modularize only the fixed part which was the columns to aggregate. Is there a way to fix this? If not I can bear with it but I wanna know the why.