2

I am trying to find the mean of a column in Pandas. The rows from which the mean will be calculated, need to be based on duplicate values in another column. In the example below, I want to populate the empty "mean_humidity" column with the mean calculated from the "humidity" column for each individual date based on the "date" column. For example, the first 3 rows in the "mean_humidity" column will be the mean calculated from the first 3 rows in the "humidity" column. Thanks!

>>> import pandas as pd
>>> dates = ['1/1/2020', '1/1/2020', '1/1/2020', '1/2/2020', '1/2/2020', '1/2/2020']
>>> humidity = [11, 22, 33, 44, 55, 66]
>>> df = pd.DataFrame(list(zip(dates, humidity)),
...                columns =['dates', 'humidity'])
>>> df["mean_humidity"] = ""
>>> df
      dates  humidity mean_humidity
0  1/1/2020        11
1  1/1/2020        22
2  1/1/2020        33
3  1/2/2020        44
4  1/2/2020        55
5  1/2/2020        66
Starlink
  • 55
  • 4

2 Answers2

2

by group by :

df["mean_humidity"] = df.groupby('dates')['humidity'].transform('mean')
print(df)

output:

>>>
      dates  humidity  mean_humidity
0  1/1/2020        11           22.0
1  1/1/2020        22           22.0
2  1/1/2020        33           22.0
3  1/2/2020        44           55.0
4  1/2/2020        55           55.0
5  1/2/2020        66           55.0
eshirvana
  • 23,227
  • 3
  • 22
  • 38
1

You could use:

df["mean_humidity"] = df["humidity"].groupby(df["dates"]).transform("mean")

Output:

    dates       humidity    mean_humidity
0   1/1/2020    11          22.0
1   1/1/2020    22          22.0
2   1/1/2020    33          22.0
3   1/2/2020    44          55.0
4   1/2/2020    55          55.0
5   1/2/2020    66          55.0
Marco_CH
  • 3,243
  • 8
  • 25