2

I have a pandas dataframe that consists of multiple columns. For this question we only need to focus on the "Cholesterol" and "Age" column.

In this dataframe each row represents a person and they all have an age. The Cholesterol column has a lot of NaN values and I'm trying to fill them in by taking the mean Cholesterol for their age

E.g. A row with age 37 and Cholesterol NaN how would I replace that with the mean Cholesterol of a 37 year old person

Even better probably would be getting the mean value of their age range.

E.g. A row with age 37 and Cholesterol NaN how would I replace that with the mean Cholesterol of a person between the age of 30-40

As I do not have a lot of values for each specific age

  • Could you provide an example of your data? Maybe doing df.head().to_dict()? Anyway, probably you can group by 'age' and get the mean of each age group and use it to fill the NaN values. – Yolao_21 Mar 02 '23 at 17:23

1 Answers1

2

You can use:

mean = df.groupby(df['Age'] // 10)['Cholesterol'].transform('mean')
df['Cholesterol2'] = df['Cholesterol'].fillna(mean)

Output:

>>> df.head(20)
    Age  Cholesterol  Cholesterol2
0    47     1.623586      1.623586
1    17     7.035887      7.035887
2    22          NaN      4.901557  # <- mean of group 2 (20-29)
3    27     5.033214      5.033214
4    33     8.508555      8.508555
5    19     1.284305      1.284305
6    57     9.915908      9.915908
7    45          NaN      5.595905  # <- mean of group 4 (40-49)
8    30     4.665067      4.665067
9    65     7.578977      7.578977
10   15     1.044828      1.044828
11   53     3.775885      3.775885
12   30     6.010004      6.010004
13   70     9.855772      9.855772
14   28     2.715962      2.715962
15   56     4.552964      4.552964
16   43     5.128147      5.128147
17   81     4.848699      4.848699
18   19     5.480054      5.480054
19   22     3.049696      3.049696

Note: I created a second column to demonstration purpose only. Of course, you can override the existing column.

If you want to have more control on groups, you can use pd.cut.

Minimal Reproducible Example:

import pandas as pd
import numpy as np

rng = np.random.default_rng(2023)
df = pd.DataFrame({'Age': rng.integers(10, 90, 100),
                   'Cholesterol': rng.uniform(1, 10, 100)})
df.loc[rng.choice(df.index, 10), 'Cholesterol'] = np.nan
Corralien
  • 109,409
  • 8
  • 28
  • 52