0

there are methods like pyspark.sql.functions.least or pyspark.sql.functions.greatest but I can't see anything for mean/stddev/sum etc...

I thought I could just pivot the DF but it takes way too much memory: data.groupby("date").pivot("date").min()

So I implemented the functions:

def null_to_zero(*columns):
    return [(f.when(~f.isnull(c), f.col(c)).otherwise(0)) for c in columns]


def row_mean(*columns):
    N = len(columns)
    columns = null_to_zero(*columns)
    return sum(columns) / N


def row_stddev(*columns):
    N = len(columns)
    mu = row_mean(*columns)
    return f.sqrt((1 / N) * sum(f.pow(col - mu, 2) for col in null_to_zero(*columns)))


day_stats = data.select(
    f.least(*data.columns[:-1]).alias("min"),
    f.greatest(*data.columns[:-1]).alias("max"),
    row_mean(*data.columns[:-1]).alias("mean"),
    row_stddev(*data.columns[:-1]).alias("stddev"),
    data.columns[-1],
).show()

sample

mean of each row

Input DF

col1|col2
1|2
2,3

Output DF

mean
1.5
2.5

Is there a cleaner way of doing this?

Steven
  • 14,048
  • 6
  • 38
  • 73
Axeltherabbit
  • 680
  • 3
  • 20
  • 1
    Does this answer your question? [How to calculate mean and standard deviation given a PySpark DataFrame?](https://stackoverflow.com/questions/47995188/how-to-calculate-mean-and-standard-deviation-given-a-pyspark-dataframe) – doctorlove Sep 15 '22 at 12:46
  • 1
    @doctorlove seems like it's over columns not rows – Axeltherabbit Sep 15 '22 at 12:49
  • 1
    there are no in-built functions that calculate row-mean. so, I think what youve done is acceptable and optimal. – samkart Sep 15 '22 at 13:27

1 Answers1

1

You can try something like this (not sure performance wise it is better, however its cleaner):

1.Concat required columns to an array type

2.Compute mean of the array

import pyspark.sql.functions as f
from pyspark.sql.types import *

sdf.withColumn("new_col", f.array( f.col("a"),f.col("b"))
                          .cast(ArrayType(FloatType())))\
   .withColumn("mean",  f.expr('aggregate(new_col, 0L, (acc,x) -> acc+x, acc -> acc /size(new_col))'))\
   .show(3)

+---+---+--------+----+
|  a|  b| new_col|mean|
+---+---+--------+----+
| 10| 41|[10, 41]|25.5|
|  9| 41| [9, 41]|25.0|
|  9| 41| [9, 41]|25.0|
+---+---+--------+----+
only showing top 3 rows
samkart
  • 6,007
  • 2
  • 14
  • 29
s510
  • 2,271
  • 11
  • 18
  • thanks, what is `->`? I've never seen it – Axeltherabbit Sep 15 '22 at 15:06
  • lambda expression that sums up the array and then divides them by size – s510 Sep 15 '22 at 15:59
  • check https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.functions.aggregate.html – s510 Sep 15 '22 at 18:43
  • To me, it only works if I change `0L` with `0F` or `0D`. `L` makes 0 an integer which cannot be aggregated with a float. `F` makes 0 a float, `D` makes 0 a double. Both of these can be aggregated with FloatType. – ZygD Sep 19 '22 at 05:16