I am new in pyspark, so I would be thankful if anyone can help to fix the problem.
Suppose that I have a dataframe in pyspark as follows:
+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
| A|2001| 2| 5| 6|
| A|2001| 3| 6| 10|
| A|2001| 3| 6| 10|
| A|2002| 4| 5| 2|
| B|2001| 2| 9| 4|
| B|2001| 2| 4| 3|
| B|2001| 2| 3| 4|
| B|2001| 3| 95| 7|
+----+----+----+----+----+
I want to get the mean of the col4
if the corresponding values in col1
, col2
, and col3
are the same and then get rid of the rows with the repeated values in the first 3 columns.
For example, the values of the col1
, col2
, col3
in the two first column are same, so, we want to eliminate one of them and update the value of col4
as the mean of col4
and col5
. The result should be:
+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
| A|2001| 2| 4.5| 7|
| A|2001| 3| 6| 10|
| A|2002| 4| 5| 2|
| B|2001| 2|5.33|3.67|
| B|2001| 3| 95| 7|
+----+----+----+----+----+
The similar question has been ask but in pandas dataframe. This question is asked in pyspark dataframe