How to calculate mean and standard deviation given a PySpark DataFrame?

Question

I have PySpark DataFrame (not pandas) called df that is quite large to use collect(). Therefore the below-given code is not efficient. It was working with a smaller amount of data, however now it fails.

import numpy as np

myList = df.collect()
total = []
for product,nb in myList:
    for p2,score in nb:
            total.append(score)
mean = np.mean(total)
std = np.std(total)

Is there any way to get mean and std as two variables by using pyspark.sql.functions or similar?

from pyspark.sql.functions import mean as mean_, std as std_

I could use withColumn, however, this approach applies the calculations row by row, and it does not return a single variable.

UPDATE:

Sample content of df:

+----------+------------------+
|product_PK|          products|
+----------+------------------+
|       680|[[691,1], [692,5]]|
|       685|[[691,2], [692,2]]|
|       684|[[691,1], [692,3]]|

I should calculate mean and standard deviation of score values, e.g. the value 1 in [691,1] is one of scores.

Can you give a sample of of your data? – Tai Dec 27 '17 at 16:26 — Tai, Dec 27 '17 at 16:26
@Tai: I posted a small example. – Markus Dec 27 '17 at 16:36 — Markus, Dec 27 '17 at 16:36

pault · Accepted Answer · 2018-04-28T12:13:04.997

You can use the built in functions to get aggregate statistics. Here's how to get mean and standard deviation.

from pyspark.sql.functions import mean as _mean, stddev as _stddev, col

df_stats = df.select(
    _mean(col('columnName')).alias('mean'),
    _stddev(col('columnName')).alias('std')
).collect()

mean = df_stats[0]['mean']
std = df_stats[0]['std']

Note that there are three different standard deviation functions. From the docs the one I used (stddev) returns the following:

Aggregate function: returns the unbiased sample standard deviation of the expression in a group

You could use the describe() method as well:

df.describe().show()

Refer to this link for more info: pyspark.sql.functions

UPDATE: This is how you can work through the nested data.

Use explode to extract the values into separate rows, then call mean and stddev as shown above.

Here's a MWE:

from pyspark.sql.types import IntegerType
from pyspark.sql.functions import explode, col, udf, mean as _mean, stddev as _stddev

# mock up sample dataframe
df = sqlCtx.createDataFrame(
    [(680, [[691,1], [692,5]]), (685, [[691,2], [692,2]]), (684, [[691,1], [692,3]])],
    ["product_PK", "products"]
)

# udf to get the "score" value - returns the item at index 1
get_score = udf(lambda x: x[1], IntegerType())

# explode column and get stats
df_stats = df.withColumn('exploded', explode(col('products')))\
    .withColumn('score', get_score(col('exploded')))\
    .select(
        _mean(col('score')).alias('mean'),
        _stddev(col('score')).alias('std')
    )\
    .collect()

mean = df_stats[0]['mean']
std = df_stats[0]['std']

print([mean, std])

Which outputs:

[2.3333333333333335, 1.505545305418162]

You can verify that these values are correct using numpy:

vals = [1,5,2,2,1,3]
print([np.mean(vals), np.std(vals, ddof=1)])

Explanation: Your "products" column is a list of lists. Calling explode will make a new row for each element of the outer list. Then grab the "score" value from each of the exploded rows, which you have defined as the second element in a 2-element list. Finally, call the aggregate functions on this new column.

Nice, thanks. However, one of difficulties is the nested data that I have. Could you please check my update? — Markus, Dec 27 '17 at 16:40
I get an error `UnboundLocalError: local variable 'mean' referenced before assignment`. — Markus, Dec 28 '17 at 09:19
To solve this error, it's necessary to update the import statement as follows: `from pyspark.sql.functions import mean as _mean, stddev as _stddev` — Markus, Dec 28 '17 at 09:38
@Markus thanks for pointing that out. I've updated the code accordingly. — pault, Dec 28 '17 at 14:50
This worked out of the box with pySparkon on Hive 2.3.6 and Spark 2.4.5. Thanks! — Pablo Adames, Aug 04 '20 at 01:03

score 2 · Answer 2 · answered Jan 08 '20 at 20:16

You can use mean and stddev from pyspark.sql.functions:

import pyspark.sql.functions as F

df = spark.createDataFrame(
    [(680, [[691,1], [692,5]]), (685, [[691,2], [692,2]]), (684, [[691,1], [692,3]])],
    ["product_PK", "products"]
)

result_df = (
    df
    .withColumn(
        'val_list',
        F.array(df.products.getItem(0).getItem(1),df.products.getItem(1).getItem(1))
    )
    .select(F.explode('val_list').alias('val'))
    .select(F.mean('val').alias('mean'), F.stddev('val').alias('stddev'))
)

print(result_df.collect())

which outputs:

[Row(mean=2.3333333333333335, stddev=1.505545305418162)]

You can read more about pyspark.sql.functions here.

score 1 · Answer 3 · answered Jun 08 '19 at 18:13

1

For Standard Deviation, better way of writing is as below. We can use formatting (to 2 decimal) and using the column Alias name

data_agg=SparkSession.builder.appName('Sales_fun').getOrCreate()    
data=data_agg.read.csv('sales_info.csv',inferSchema=True, header=True)

from pyspark.sql.functions import *

*data.select((format_number(stddev('Sales'),2)).alias('Sales_Stdev')).show()*

answered Jun 08 '19 at 18:13

BigData-Guru

1,161
1
15
20

4

[Do not use `import *`](https://stackoverflow.com/a/55711135/5858851) – pault Nov 20 '19 at 16:21
Yeah offcourse .. that's just for reference. Thanks – BigData-Guru Nov 22 '19 at 10:07

score 0 · Answer 4 · answered Jul 01 '21 at 22:27

If you just want the Mean and Std. dev of any column then
The simplest way I can think of is using agg function

Get the Mean of a column

df.agg({'produ': 'mean'}).show()

# or you can also use 
data.agg({'balance': 'avg'}).show()

Get the Standard Deviation of a column

data.agg({'balance': 'stddev'}).show()

# and for variance you can use
data.agg({'balance': 'variance'}).show()

How to calculate mean and standard deviation given a PySpark DataFrame?

4 Answers4

Linked