Manage the missing value in a dataframe with string and number

Question

I have a dataframe with some string columns and number columns. I want to manage the missing values. I want to change the "nan" values with mean of each row. I saw the different question in this website, however, they are different from my question. Like this link: Pandas Dataframe: Replacing NaN with row average

If all the values of a rows are "Nan" values, I want to delete that rows. I have also provide a sample case as follows:

import pandas as pd
import numpy as np
df = pd.DataFrame()
df['id'] = ['a', 'b', 'c', 'n']
df['md'] = ['d', 'e', 'f', 'l']
df['c1'] = [2, np.nan,np.nan, 5]
df['c2'] = [0, 5, np.nan, 3]
df['c3'] = [8, 7, np.nan,np.nan]

df = pd.DataFrame()
df['id'] = ['a', 1, 'n']
df['md'] = ['d', 6, 'l']
df['c1'] = [2, 6, 5]
df['c2'] = [0, 5, 3]
df['c3'] = [8, 7,4]
df

Note: I have used the following code, however it is very slow and for a big dataframe it take a looong time to run.

index_colum = df.columns.get_loc("c1")
df_withno_id = df.iloc[:,index_colum:]

rowsidx_with_all_NaN = df_withno_id[df_withno_id.isnull().all(axis=1)].index.values
df = df.drop(df.index[rowsidx_with_all_NaN])

for i, cols in df_withno_id.iterrows():
   if i not in rowsidx_with_all_NaN:
      endsidx = len(cols)
      extract_data = list(cols[0:endsidx])
      mean = np.nanmean(extract_data)
      fill_nan = [mean for x in extract_data if np.isnan(x)]
      df.loc[i] = df.loc[i].replace(np.nan, mean)

Can anybody help me with this? thanks.

score 0 · Accepted Answer · answered Feb 10 '22 at 00:31

0

First, you can select only float columns types. Second, for these columns drop rows with all nan values. Finally, you can transpose dataframe (only float columns), calculate average value and later transpose again.

import pandas as pd
import numpy as np
df = pd.DataFrame()
df['id'] = ['a', 'b', 'c', 'n']
df['md'] = ['d', 'e', 'f', 'l']
df['c1'] = [2, np.nan,np.nan, 5]
df['c2'] = [0, 5, np.nan, 3]
df['c3'] = [8, 7, np.nan,np.nan]

numeric_cols = df.select_dtypes(include='float64').columns
df.dropna(how = 'all', subset = numeric_cols, inplace = True)
df[numeric_cols] = df[numeric_cols].T.fillna(df[numeric_cols].T.mean()).T
df

answered Feb 10 '22 at 00:31

gabriel11

80
6

Thanks @gabriel. Yes. This solution works for this example. However, in my real dataframe, there is some columns which their id is also number and with this. – Feb 10 '22 at 00:54
I'm don't understand, can you give a sample of that type of columns?? – gabriel11 Feb 10 '22 at 12:30
I've updated the question. However, I have solved the problem with another way. Thanks for reaching out. – Feb 10 '22 at 17:29

Manage the missing value in a dataframe with string and number

1 Answers1