How to remove columns with too many missing values in Python

Question

I'm working on a machine learning problem in which there are many missing values in the features. There are 100's of features and I would like to remove those features that have too many missing values (it can be features with more than 80% missing values). How can I do that in Python?

My data is a Pandas dataframe.

score 29 · Answer 1 · edited Feb 06 '21 at 16:27

Demo:

Setup:

In [105]: df = pd.DataFrame(np.random.choice([2,np.nan], (20, 5), p=[0.2, 0.8]), columns=list('abcde'))

In [106]: df
Out[106]:
      a    b    c    d    e
0   NaN  2.0  NaN  NaN  NaN
1   NaN  NaN  2.0  NaN  2.0
2   NaN  2.0  NaN  NaN  NaN
3   NaN  NaN  NaN  NaN  2.0
4   NaN  2.0  2.0  NaN  NaN
5   NaN  NaN  NaN  NaN  NaN
6   NaN  2.0  NaN  NaN  NaN
7   2.0  2.0  NaN  NaN  NaN
8   2.0  2.0  NaN  NaN  NaN
9   NaN  NaN  NaN  NaN  NaN
10  NaN  2.0  2.0  NaN  2.0
11  NaN  NaN  NaN  2.0  NaN
12  2.0  NaN  NaN  2.0  NaN
13  NaN  NaN  NaN  2.0  NaN
14  NaN  NaN  NaN  2.0  2.0
15  NaN  NaN  NaN  NaN  NaN
16  NaN  2.0  NaN  NaN  NaN
17  2.0  NaN  NaN  NaN  2.0
18  NaN  NaN  NaN  2.0  NaN
19  NaN  2.0  NaN  2.0  NaN

In [107]: df.isnull().mean()
Out[107]:
a    0.80
b    0.55
c    0.85
d    0.70
e    0.75
dtype: float64

Solution:

In [108]: df.columns[df.isnull().mean() < 0.8]
Out[108]: Index(['b', 'd', 'e'], dtype='object')

In [109]: df[df.columns[df.isnull().mean() < 0.8]]
Out[109]:
      b    d    e
0   2.0  NaN  NaN
1   NaN  NaN  2.0
2   2.0  NaN  NaN
3   NaN  NaN  2.0
4   2.0  NaN  NaN
5   NaN  NaN  NaN
6   2.0  NaN  NaN
7   2.0  NaN  NaN
8   2.0  NaN  NaN
9   NaN  NaN  NaN
10  2.0  NaN  2.0
11  NaN  2.0  NaN
12  NaN  2.0  NaN
13  NaN  2.0  NaN
14  NaN  2.0  2.0
15  NaN  NaN  NaN
16  2.0  NaN  NaN
17  NaN  NaN  2.0
18  NaN  2.0  NaN
19  2.0  2.0  NaN

Great solution as always, +1. However, for visibility I'd say it is better to have more columns rather than rows. I added a row filter as an answer too. (or maybe just me - sitting on laptop atm) — Anton vBR, Aug 04 '17 at 21:00

score 20 · Answer 2 · edited Feb 06 '21 at 18:10

20

You can use Pandas' dropna().

limitPer = len(yourdf) * .80
yourdf = yourdf.dropna(thresh=limitPer, axis=1)

edited Feb 06 '21 at 18:10

Peter Mortensen

30,738
21
105
131

answered Jun 11 '18 at 13:13

singmotor

3,930
12
45
79

1

The more pandas-style solution! – Mohammad-Reza Malekpour Dec 23 '21 at 05:54

score 5 · Answer 3 · edited Feb 06 '21 at 18:09

Following MaxU's example, this is the option for filtering rows:

    df = pd.DataFrame(np.random.choice([2,np.nan], (5,10), p=[0.2, 0.8]), columns=list('abcdefghij'))

        a    b    c    d    e    f    g    h    i    j
    0   NaN  NaN  NaN  NaN  NaN  2.0  NaN  NaN  NaN  2.0
    1   NaN  2.0  NaN  2.0  NaN  NaN  2.0  NaN  NaN  2.0
    2   NaN  NaN  2.0  NaN  2.0  NaN  2.0  2.0  NaN  NaN
    3   NaN  NaN  NaN  NaN  NaN  2.0  NaN  NaN  NaN  2.0
    4   2.0  2.0  2.0  NaN  NaN  NaN  NaN  NaN  NaN  NaN

Rows

    df.loc[df.isnull().mean(axis=1).lt(0.8)]

        a    b    c    d    e    f    g    h    i    j
    1   NaN  2.0  NaN  2.0  NaN  NaN  2.0  NaN  NaN  2.0
    2   NaN  NaN  2.0  NaN  2.0  NaN  2.0  2.0  NaN  NaN
    4   2.0  2.0  2.0  NaN  NaN  NaN  NaN  NaN  NaN  NaN

score 4 · Answer 4 · edited Feb 06 '21 at 16:28

4

To generalize within Pandas you can do the following to calculate the percent of values in a column with missing values. From those columns you can filter out the features with more than 80% NULL values and then drop those columns from the DataFrame.

pct_null = df.isnull().sum() / len(df)
missing_features = pct_null[pct_null > 0.80].index
df.drop(missing_features, axis=1, inplace=True)

edited Feb 06 '21 at 16:28

Peter Mortensen

30,738
21
105
131

answered Aug 04 '17 at 20:40

vielkind

2,840
1
16
16

score 4 · Answer 5 · edited Feb 06 '21 at 18:13

Here is a simple function which you can use directly by passing a dataframe and a threshold

def rmissingvaluecol(dff, threshold):
    l = []
    l = list(dff.drop(dff.loc[:,list((100*(dff.isnull().sum()/len(dff.index)) >= threshold))].columns, 1).columns.values)
    print("# Columns having more than %s percent missing values: "%threshold, (dff.shape[1] - len(l)))
    print("Columns:\n", list(set(list((dff.columns.values))) - set(l)))
    return l


rmissingvaluecol(df,80) # Here threshold is 80% which means we are going to drop columns having more than 80% of missing values

# Output
'''
# Columns having more than 60 percent missing values: 2
Columns:
 ['id', 'location']
'''

Now create a new dataframe excluding these columns:

l = rmissingvaluecol(df, 49)
df1 = df[l]

Bonus step

You can find the percentage of missing values for each column (optional)

def missing(dff):
    print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))

missing(df)

# Output
'''
id          83.33
location    83.33
owner       16.67
pets        16.67
dtype: float64
'''

It is there already.... l = rmissingvaluecol(df, 49) df1 = df[l] — Suhas_Pote, Mar 22 '22 at 14:06
Thanks. that saves the columns that are left. Can I store the columns being removed too? — Tinkinc, Mar 22 '22 at 14:41

score 2 · Answer 6 · edited Feb 06 '21 at 18:16

2

The fastest way to find the sum of NaN or the percentage by columns is:

for the sum: df.isna().sum()
for the percentage: df.isna().mean()

edited Feb 06 '21 at 18:16

Peter Mortensen

30,738
21
105
131

answered Aug 01 '19 at 19:11

BP34500

158
8

score 1 · Answer 7 · edited Feb 06 '21 at 22:26

def show_null_columns(data, agg, threshold):
    if agg == 'sum':
       null_cols = data.isnull().sum()
    elif agg == 'mean':
       null_cols = data.isnull().mean()
    columns = data.columns
    null_dic = {}
    for col,x in zip(columns, null_cols):
        if x>= threshold:
            null_dic[col] = x
    return null_dic

null_dic = show_null_columns(train, 'mean', 0.8)
train2 = train.drop(null_dic.keys(), axis=1)

score 0 · Answer 8 · edited Feb 06 '21 at 22:24

0

Use:

df = df[df.isnull().sum(axis=1) <= 5]

Here we remove the missing values from the rows having greater than five missing values.

edited Feb 06 '21 at 22:24

Peter Mortensen

30,738
21
105
131

answered Jan 11 '21 at 21:12

Rajat Rai

1

score 0 · Answer 9 · edited May 14 '23 at 07:08

0

One thing about dropna() according to the documentation: the thresh argument specifies the number of non-NaNs to keep.

edited May 14 '23 at 07:08

Peter Mortensen

30,738
21
105
131

answered Feb 15 '21 at 14:40

ricecooker

81
2
5

Welcome to StackOverflow. This seems more like a comment than an answer. Please consider commenting on the answer you like the best. – rajah9 Feb 15 '21 at 14:59
I did try that, but I don't seem to have enough "reputation" yet. However, I think the point I mentioned can change the output of dropna(). – ricecooker Feb 15 '21 at 15:03
Hang in there. It won't be long before you can make comments. – rajah9 Feb 15 '21 at 15:11
Thanks for the words of encouragement! – ricecooker Feb 15 '21 at 15:12
Related: *[Why do I need 50 reputation to comment? What can I do instead?](https://meta.stackexchange.com/questions/214173/)*. – Peter Mortensen May 14 '23 at 07:06
What is "thresh" for? [Threshold](https://en.wiktionary.org/wiki/threshold#Noun)? – Peter Mortensen May 14 '23 at 07:09

How to remove columns with too many missing values in Python

9 Answers9

Rows

Here is a simple function which you can use directly by passing a dataframe and a threshold

Bonus step