there are methods like pyspark.sql.functions.least
or pyspark.sql.functions.greatest
but I can't see anything for mean/stddev/sum etc...
I thought I could just pivot the DF but it takes way too much memory:
data.groupby("date").pivot("date").min()
So I implemented the functions:
def null_to_zero(*columns):
return [(f.when(~f.isnull(c), f.col(c)).otherwise(0)) for c in columns]
def row_mean(*columns):
N = len(columns)
columns = null_to_zero(*columns)
return sum(columns) / N
def row_stddev(*columns):
N = len(columns)
mu = row_mean(*columns)
return f.sqrt((1 / N) * sum(f.pow(col - mu, 2) for col in null_to_zero(*columns)))
day_stats = data.select(
f.least(*data.columns[:-1]).alias("min"),
f.greatest(*data.columns[:-1]).alias("max"),
row_mean(*data.columns[:-1]).alias("mean"),
row_stddev(*data.columns[:-1]).alias("stddev"),
data.columns[-1],
).show()
sample
mean of each row
Input DF
col1|col2
1|2
2,3
Output DF
mean
1.5
2.5
Is there a cleaner way of doing this?