5

Is it possible to impute values for a specific column?

For example, if I have 3 columns:

  • A (categorical): does not contain any missing values
  • B (numeric): does not contain any missing values
  • C: suppose this column contains numerics data and some of them are missing . I want to do the imputation only in this column.
TylerH
  • 20,799
  • 66
  • 75
  • 101
Glorian
  • 127
  • 1
  • 1
  • 10
  • Are ***A,B*** integer, numeric, categorical? – smci Apr 12 '18 at 21:14
  • @smci: let's say A is categorical and B is numeric – Glorian Apr 12 '18 at 21:15
  • Are you trying to impute from A,B,C (multiple imputation), or only from C (single imputation)? I think you're trying to do the former. – smci Apr 12 '18 at 22:30
  • First you need to convert categorical data to numerical by encoding. Then you can use the regression model to predict the missing values. – Vivek Kumar Apr 16 '18 at 12:13
  • @smci: sorry for my late reply. I only want to impute in column C (single imputation) – Glorian Jun 12 '18 at 18:38
  • @VivekKumar: is there any simpler solution which does not involve applying a learning model to predict missing values? IMHO, the solution that you propose, is a bit complicated for the pre-processing step. – Glorian Jun 12 '18 at 18:39
  • As you clarified now that you want to do single imputation (independent from other columns), you can use [Imputer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html) which can choose from multiple techniques `(mean, mode, median)` to fill the missing values. – Vivek Kumar Jun 13 '18 at 01:07

3 Answers3

12

You can use numpy.ravel:

from sklearn.preprocessing import Imputer
imp = Imputer(missing_values=0, strategy="mean", axis=0)
df["C"] = imp.fit_transform(df[["C"]]).ravel()
smci
  • 32,567
  • 20
  • 113
  • 146
mcard
  • 617
  • 1
  • 9
  • 16
  • 1
    Imputer only does single imputation (e.g. of column C to column C). Sounds like OP wants to do multiple imputation from A,B,C – smci Apr 12 '18 at 22:48
  • 2
    Also for future viewrs, use SimpleImputer as Imputer class has been deprecated. check https://github.com/scikit-learn/scikit-learn/blob/8d7e849428a4edd16c3e2a7dc8a088f108986a17/sklearn/preprocessing/imputation.py#L64 – Ali H. Kudeir Apr 10 '21 at 00:23
4

If you have a dataframe with missing data in multiple columns, and you want to impute a specific column based on the others, you can impute everything and take that specific column that you want:

from sklearn.impute import KNNImputer
import pandas as pd

imputer = KNNImputer()
imputed_data = imputer.fit_transform(df)  # impute all the missing data
df_temp = pd.DataFrame(imputed_data)
df_temp.columns = df.columns
df['COL_TO_IMPUTE'] = df_temp['COL_TO_IMPUTE']  # update only the desired column

Another method would be to transform all the missing data in the desired column to a unique character that is not contained in the other columns, say # if the data is strings (or max + 1 if the data is numeric), and then tell the imputer that your missing data is #:

from sklearn.impute import KNNImputer
import pandas as pd

cols_backup = df.columns
df['COL_TO_IMPUTE'].fillna('#', inplace=True)  # replace all missing data in desired column with with '#'
imputer = KNNImputer(missing_values='#')  # tell the imputer to consider only '#' as missing data
imputed_data = imputer.fit_transform(df)  # impute all '#'
df = pd.DataFrame(data=imputed_data, columns=cols_backup)
Alaa M.
  • 4,961
  • 10
  • 54
  • 95
-1

As you said some of columns are have no missing data that means when you use any of imputation methods such as mean, KNN, or other will just imputes missing values in column C. only you have to do pass your data with missing to any of imputation method then you will get full data with no missing.

imr = SimpleImputer(missing_values=np.NaN, strategy='mean')
imr = imr.fit(with_missing)
SimpleImputer()
imputed_data = imr.transform(with_missing)

or with kNN imputer

 imputer_KNN = KNNImputer(missing_values="NaN", n_neighbors=3, weights="uniform", metric="masked_euclidean")
imputed_data = imputer_KNN.fit_transform(with_missing)
  • Your current answer can be listed as a comment, if you want to keep your answer, then yo should elaborate more and may be include an example. – Arar Apr 01 '20 at 15:33