Scikit-learn - Impute values in a specific column

Question

Is it possible to impute values for a specific column?

For example, if I have 3 columns:

A (categorical): does not contain any missing values
B (numeric): does not contain any missing values
C: suppose this column contains numerics data and some of them are missing . I want to do the imputation only in this column.

Are you trying to impute from A,B,C (multiple imputation), or only from C (single imputation)? I think you're trying to do the former. — smci, Apr 12 '18 at 22:30
First you need to convert categorical data to numerical by encoding. Then you can use the regression model to predict the missing values. — Vivek Kumar, Apr 16 '18 at 12:13
@smci: sorry for my late reply. I only want to impute in column C (single imputation) — Glorian, Jun 12 '18 at 18:38
@VivekKumar: is there any simpler solution which does not involve applying a learning model to predict missing values? IMHO, the solution that you propose, is a bit complicated for the pre-processing step. — Glorian, Jun 12 '18 at 18:39
As you clarified now that you want to do single imputation (independent from other columns), you can use [Imputer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html) which can choose from multiple techniques `(mean, mode, median)` to fill the missing values. — Vivek Kumar, Jun 13 '18 at 01:07

score 12 · Answer 1 · edited Apr 12 '18 at 22:44

12

from sklearn.preprocessing import Imputer
imp = Imputer(missing_values=0, strategy="mean", axis=0)
df["C"] = imp.fit_transform(df[["C"]]).ravel()

edited Apr 12 '18 at 22:44

smci

32,567
20
113
146

answered Apr 12 '18 at 21:24

mcard

617
1
9
16

1

Imputer only does single imputation (e.g. of column C to column C). Sounds like OP wants to do multiple imputation from A,B,C – smci Apr 12 '18 at 22:48
2

Also for future viewrs, use SimpleImputer as Imputer class has been deprecated. check https://github.com/scikit-learn/scikit-learn/blob/8d7e849428a4edd16c3e2a7dc8a088f108986a17/sklearn/preprocessing/imputation.py#L64 – Ali H. Kudeir Apr 10 '21 at 00:23

Alaa M. · Answer 2 · 2021-02-06T17:34:24.153

If you have a dataframe with missing data in multiple columns, and you want to impute a specific column based on the others, you can impute everything and take that specific column that you want:

from sklearn.impute import KNNImputer
import pandas as pd

imputer = KNNImputer()
imputed_data = imputer.fit_transform(df)  # impute all the missing data
df_temp = pd.DataFrame(imputed_data)
df_temp.columns = df.columns
df['COL_TO_IMPUTE'] = df_temp['COL_TO_IMPUTE']  # update only the desired column

Another method would be to transform all the missing data in the desired column to a unique character that is not contained in the other columns, say # if the data is strings (or max + 1 if the data is numeric), and then tell the imputer that your missing data is #:

from sklearn.impute import KNNImputer
import pandas as pd

cols_backup = df.columns
df['COL_TO_IMPUTE'].fillna('#', inplace=True)  # replace all missing data in desired column with with '#'
imputer = KNNImputer(missing_values='#')  # tell the imputer to consider only '#' as missing data
imputed_data = imputer.fit_transform(df)  # impute all '#'
df = pd.DataFrame(data=imputed_data, columns=cols_backup)

Mohammed Shantal · Answer 3 · 2020-09-15T03:37:39.677

-1

As you said some of columns are have no missing data that means when you use any of imputation methods such as mean, KNN, or other will just imputes missing values in column C. only you have to do pass your data with missing to any of imputation method then you will get full data with no missing.

imr = SimpleImputer(missing_values=np.NaN, strategy='mean')
imr = imr.fit(with_missing)
SimpleImputer()
imputed_data = imr.transform(with_missing)

or with kNN imputer

 imputer_KNN = KNNImputer(missing_values="NaN", n_neighbors=3, weights="uniform", metric="masked_euclidean")
imputed_data = imputer_KNN.fit_transform(with_missing)

edited Sep 15 '20 at 03:37

answered Apr 01 '20 at 13:26

Mohammed Shantal

11
2

Your current answer can be listed as a comment, if you want to keep your answer, then yo should elaborate more and may be include an example. – Arar Apr 01 '20 at 15:33

Scikit-learn - Impute values in a specific column

3 Answers3