12

I'm working on a machine learning problem in which there are many missing values in the features. There are 100's of features and I would like to remove those features that have too many missing values (it can be features with more than 80% missing values). How can I do that in Python?

My data is a Pandas dataframe.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
HHH
  • 6,085
  • 20
  • 92
  • 164

9 Answers9

29

Demo:

Setup:

In [105]: df = pd.DataFrame(np.random.choice([2,np.nan], (20, 5), p=[0.2, 0.8]), columns=list('abcde'))

In [106]: df
Out[106]:
      a    b    c    d    e
0   NaN  2.0  NaN  NaN  NaN
1   NaN  NaN  2.0  NaN  2.0
2   NaN  2.0  NaN  NaN  NaN
3   NaN  NaN  NaN  NaN  2.0
4   NaN  2.0  2.0  NaN  NaN
5   NaN  NaN  NaN  NaN  NaN
6   NaN  2.0  NaN  NaN  NaN
7   2.0  2.0  NaN  NaN  NaN
8   2.0  2.0  NaN  NaN  NaN
9   NaN  NaN  NaN  NaN  NaN
10  NaN  2.0  2.0  NaN  2.0
11  NaN  NaN  NaN  2.0  NaN
12  2.0  NaN  NaN  2.0  NaN
13  NaN  NaN  NaN  2.0  NaN
14  NaN  NaN  NaN  2.0  2.0
15  NaN  NaN  NaN  NaN  NaN
16  NaN  2.0  NaN  NaN  NaN
17  2.0  NaN  NaN  NaN  2.0
18  NaN  NaN  NaN  2.0  NaN
19  NaN  2.0  NaN  2.0  NaN

In [107]: df.isnull().mean()
Out[107]:
a    0.80
b    0.55
c    0.85
d    0.70
e    0.75
dtype: float64

Solution:

In [108]: df.columns[df.isnull().mean() < 0.8]
Out[108]: Index(['b', 'd', 'e'], dtype='object')

In [109]: df[df.columns[df.isnull().mean() < 0.8]]
Out[109]:
      b    d    e
0   2.0  NaN  NaN
1   NaN  NaN  2.0
2   2.0  NaN  NaN
3   NaN  NaN  2.0
4   2.0  NaN  NaN
5   NaN  NaN  NaN
6   2.0  NaN  NaN
7   2.0  NaN  NaN
8   2.0  NaN  NaN
9   NaN  NaN  NaN
10  2.0  NaN  2.0
11  NaN  2.0  NaN
12  NaN  2.0  NaN
13  NaN  2.0  NaN
14  NaN  2.0  2.0
15  NaN  NaN  NaN
16  2.0  NaN  NaN
17  NaN  NaN  2.0
18  NaN  2.0  NaN
19  2.0  2.0  NaN
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • 1
    Great solution as always, +1. However, for visibility I'd say it is better to have more columns rather than rows. I added a row filter as an answer too. (or maybe just me - sitting on laptop atm) – Anton vBR Aug 04 '17 at 21:00
20

You can use Pandas' dropna().

limitPer = len(yourdf) * .80
yourdf = yourdf.dropna(thresh=limitPer, axis=1)
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
singmotor
  • 3,930
  • 12
  • 45
  • 79
5

Following MaxU's example, this is the option for filtering rows:

    df = pd.DataFrame(np.random.choice([2,np.nan], (5,10), p=[0.2, 0.8]), columns=list('abcdefghij'))
        a    b    c    d    e    f    g    h    i    j
    0   NaN  NaN  NaN  NaN  NaN  2.0  NaN  NaN  NaN  2.0
    1   NaN  2.0  NaN  2.0  NaN  NaN  2.0  NaN  NaN  2.0
    2   NaN  NaN  2.0  NaN  2.0  NaN  2.0  2.0  NaN  NaN
    3   NaN  NaN  NaN  NaN  NaN  2.0  NaN  NaN  NaN  2.0
    4   2.0  2.0  2.0  NaN  NaN  NaN  NaN  NaN  NaN  NaN

Rows

    df.loc[df.isnull().mean(axis=1).lt(0.8)]
        a    b    c    d    e    f    g    h    i    j
    1   NaN  2.0  NaN  2.0  NaN  NaN  2.0  NaN  NaN  2.0
    2   NaN  NaN  2.0  NaN  2.0  NaN  2.0  2.0  NaN  NaN
    4   2.0  2.0  2.0  NaN  NaN  NaN  NaN  NaN  NaN  NaN
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Anton vBR
  • 18,287
  • 5
  • 40
  • 46
4

To generalize within Pandas you can do the following to calculate the percent of values in a column with missing values. From those columns you can filter out the features with more than 80% NULL values and then drop those columns from the DataFrame.

pct_null = df.isnull().sum() / len(df)
missing_features = pct_null[pct_null > 0.80].index
df.drop(missing_features, axis=1, inplace=True)
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
vielkind
  • 2,840
  • 1
  • 16
  • 16
4

Here is a simple function which you can use directly by passing a dataframe and a threshold

def rmissingvaluecol(dff, threshold):
    l = []
    l = list(dff.drop(dff.loc[:,list((100*(dff.isnull().sum()/len(dff.index)) >= threshold))].columns, 1).columns.values)
    print("# Columns having more than %s percent missing values: "%threshold, (dff.shape[1] - len(l)))
    print("Columns:\n", list(set(list((dff.columns.values))) - set(l)))
    return l


rmissingvaluecol(df,80) # Here threshold is 80% which means we are going to drop columns having more than 80% of missing values

# Output
'''
# Columns having more than 60 percent missing values: 2
Columns:
 ['id', 'location']
'''

Now create a new dataframe excluding these columns:

l = rmissingvaluecol(df, 49)
df1 = df[l]

Bonus step

You can find the percentage of missing values for each column (optional)

def missing(dff):
    print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))

missing(df)

# Output
'''
id          83.33
location    83.33
owner       16.67
pets        16.67
dtype: float64
'''
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Suhas_Pote
  • 3,620
  • 1
  • 23
  • 38
2

The fastest way to find the sum of NaN or the percentage by columns is:

  • for the sum: df.isna().sum()
  • for the percentage: df.isna().mean()
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
BP34500
  • 158
  • 8
1
def show_null_columns(data, agg, threshold):
    if agg == 'sum':
       null_cols = data.isnull().sum()
    elif agg == 'mean':
       null_cols = data.isnull().mean()
    columns = data.columns
    null_dic = {}
    for col,x in zip(columns, null_cols):
        if x>= threshold:
            null_dic[col] = x
    return null_dic

null_dic = show_null_columns(train, 'mean', 0.8)
train2 = train.drop(null_dic.keys(), axis=1)
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
mannem srinivas
  • 111
  • 2
  • 6
0

Use:

df = df[df.isnull().sum(axis=1) <= 5]

Here we remove the missing values from the rows having greater than five missing values.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
0

One thing about dropna() according to the documentation: the thresh argument specifies the number of non-NaNs to keep.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
ricecooker
  • 81
  • 2
  • 5