Suppose that we have a csv file which has been imported as a dataframe in PysPark as follows
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.csv("file path and name.csv", inferSchema = True, header = True)
df.show()
output
+-----+----+----+
|lable|year|val |
+-----+----+----+
| A|2003| 5.0|
| A|2003| 6.0|
| A|2003| 3.0|
| A|2004|null|
| B|2000| 2.0|
| B|2000|null|
| B|2009| 1.0|
| B|2000| 6.0|
| B|2009| 6.0|
+-----+----+----+
Now, we want to add another column to df
which contains the standard deviation of val
based on the grouping on two columns lable
and year
. So, the output must be as follows:
+-----+----+----+-----+
|lable|year|val | std |
+-----+----+----+-----+
| A|2003| 5.0| 1.53|
| A|2003| 6.0| 1.53|
| A|2003| 3.0| 1.53|
| A|2004|null| null|
| B|2000| 2.0| 2.83|
| B|2000|null| 2.83|
| B|2009| 1.0| 3.54|
| B|2000| 6.0| 2.83|
| B|2009| 6.0| 3.54|
+-----+----+----+-----+
I have the following codes which works for a small dataframe but it does not work for a very large dataframe (with about 40 million rows) which I am working with now.
import pyspark.sql.functions as f
a = df.groupby('lable','year').agg(f.round(f.stddev("val"),2).alias('std'))
df = df.join(a, on = ['lable', 'year'], how = 'inner')
I get Py4JJavaError Traceback (most recent call last)
error after running on my large dataframe.
Does anyone knows any alternative way? I hope your way works on my dataset.
I am using python3.7.1
, pyspark2.4
, and jupyter4.4.0