-2

I am new in pyspark, so I would be thankful if anyone can help to fix the problem.

Suppose that I have a dataframe in pyspark as follows:

+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
|   A|2001|   2|   5|   6|
|   A|2001|   3|   6|  10|
|   A|2001|   3|   6|  10|
|   A|2002|   4|   5|   2|
|   B|2001|   2|   9|   4|
|   B|2001|   2|   4|   3|
|   B|2001|   2|   3|   4|
|   B|2001|   3|  95|   7|
+----+----+----+----+----+

I want to get the mean of the col4 if the corresponding values in col1, col2, and col3 are the same and then get rid of the rows with the repeated values in the first 3 columns.

For example, the values of the col1, col2, col3 in the two first column are same, so, we want to eliminate one of them and update the value of col4 as the mean of col4 and col5. The result should be:

+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
|   A|2001|   2| 4.5|   7|
|   A|2001|   3|   6|  10|
|   A|2002|   4|   5|   2|
|   B|2001|   2|5.33|3.67|
|   B|2001|   3|  95|   7|
+----+----+----+----+----+

The similar question has been ask but in pandas dataframe. This question is asked in pyspark dataframe

bruno
  • 32,421
  • 7
  • 25
  • 37
Monirrad
  • 465
  • 1
  • 7
  • 17
  • 1
    IIUC `df.groupby("col1", "col2", "col3").agg(f.mean("col4"), f.mean("col5"))` where `f` is defined by `import pyspark.sql.functions as f` – pault Jan 10 '19 at 19:22
  • Possible duplicate of [Pyspark:How to calculate avg and count in a single groupBy?](https://stackoverflow.com/questions/51632126/pysparkhow-to-calculate-avg-and-count-in-a-single-groupby) – pault Jan 10 '19 at 19:23

1 Answers1

0

Step 1: Creating the said DataFrame -

values = [('A',2001,2,5,6),('A',2001,2,4,8),('A',2001,3,6,10),('A',2002,4,5,2),
          ('B',2001,2,9,4),('B',2001,2,4,3),('B',2001,2,3,4),('B',2001,3,95,7)]
df = sqlContext.createDataFrame(values,['col1','col2','col3','col4','col5'])
df.show()
+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
|   A|2001|   2|   5|   6|
|   A|2001|   2|   4|   8|
|   A|2001|   3|   6|  10|
|   A|2002|   4|   5|   2|
|   B|2001|   2|   9|   4|
|   B|2001|   2|   4|   3|
|   B|2001|   2|   3|   4|
|   B|2001|   3|  95|   7|
+----+----+----+----+----+

Step 2: Aggregating the column col4 & col5.

df = df.groupby(['col1','col2','col3']).agg(avg('col4').alias('col4'),avg('col5').alias('col5'))
df.show()
+----+----+----+-----------------+------------------+
|col1|col2|col3|             col4|              col5|
+----+----+----+-----------------+------------------+
|   A|2001|   3|              6.0|              10.0|
|   A|2002|   4|              5.0|               2.0|
|   B|2001|   2|5.333333333333333|3.6666666666666665|
|   A|2001|   2|              4.5|               7.0|
|   B|2001|   3|             95.0|               7.0|
+----+----+----+-----------------+------------------+
cph_sto
  • 7,189
  • 12
  • 42
  • 78