Questions tagged [imputation]

Missing data imputation is the process of replacing missing data with substituted, 'best guess', values. Because missing data can create problems for analyzing data and can lead to missing-data bias, imputation is seen as a way to avoid the problems associated with listwise deletion (ignoring all observations with any missing values).

Missing data imputation is the process of replacing missing data with substituted, 'best guess', values. Because missing data can create problems for analyzing data and can lead to missing-data bias, imputation is seen as a way to avoid the problems associated with listwise deletion (ignoring all observations with any missing values). Multiple methods for imputation exist, including: imputing missing values with a single value, such as the mean or median or some specific value based on domain-expertise; distance based heuristics such as kNN; stochastic averaging via multiple imputation; and model-based methods including Expectation Maximization (EM).

Suggested tag synonym: "missing-data"

931 questions
4
votes
0 answers

Scikit-learn mean imputation gives different mean after imputation

I am doing mean imputation for missing values in a large numpy array. >>> import numpy as np >>> from sklearn.preprocessing import Imputer ... >>> X_train_reshaped.shape (6794600, 19) >>> imp = Imputer() >>> X_train_reshaped_imputed =…
arun
  • 10,685
  • 6
  • 59
  • 81
4
votes
1 answer

How to impute NaN values to a default value if strategy fails?

Problem I am using the sklearn.preprocessing.Imputer class to impute NaN values using a mean strategy over the columns, i.e. axis=0. My problem is that some data which needs to be imputed only has NaN values in it's column, e.g. when there is only a…
Thijs van Ede
  • 861
  • 6
  • 15
4
votes
2 answers

Stripplot in MICE does not show categorical variables

I´m using the mice package in R to do multiple imputation. I´ve done several imputations with only numerical variables, the imputation method is predictive mean matching, and when I use stripplot(imp) I get to see the observed and imputed values of…
4
votes
3 answers

Simulate data and randomly add missing values to dataframe

How can I randomly add missing values to some or each column (say random ~5% missing in each) in a simulated dataframe, plus, is there a more efficient way of simulating a dataframe with both continuous and factor columns? #Simulate some data N <-…
aelhak
  • 441
  • 4
  • 14
4
votes
1 answer

Why does fillna with median on dataframe still leaves Na/NaN in pandas?

I've seen this and this thread here, but something else is wrong. I have a very large pandas DataFrame, with many Na/NaN values. I want to replace them with the median value for that feature. So, I first make a table that displays the Na values per…
GrundleMoof
  • 289
  • 3
  • 11
4
votes
2 answers

NA in time series handling?

I am dealing with a forecast of time series in R. I have several questions: I would like to ask how we can handle missing values in time series? I guess we can somehow interpolate them? Can you suggest some solution in R for this?
syeenn
  • 191
  • 2
  • 3
  • 11
4
votes
2 answers

Impute missing values to 0, and create indicator columns in Pandas

I have a very simple dataframe in Pandas, testdf = [{'name' : 'id1', 'W': np.NaN, 'L': 0, 'D':0}, {'name' : 'id2', 'W': 0, 'L': np.NaN, 'D':0}, {'name' : 'id3', 'W': np.NaN, 'L': 10, 'D':0}, {'name' : 'id4', 'W':…
Monica Heddneck
  • 2,973
  • 10
  • 55
  • 89
4
votes
4 answers

Sklearn: Categorical Imputer?

Is there a way to impute categorical values using a sklearn.preprocessing object? I would like to ultimatly create a preprocessing object which I can apply to new data and have it transformed the same way as old data. I am looking for a way to do…
4
votes
2 answers

How to fill missing values using median imputation in R for all the columns based on a customer id for panel data?

Customer id Year a b 1 2000 10 2 1 2001 5 3 1 2002 NA 4 1 2003 NA 5 2 2000 2 NA 2 2001 NA 4 2 …
4
votes
2 answers

svd imputation R

I'm trying to use the SVD imputation from the bcv package but all the imputed values are the same (by column). This is the dataset with missing data http://pastebin.com/YS9qaUPs #load data dataMiss = read.csv('dataMiss.csv') #impute…
Sojers
  • 87
  • 2
  • 8
4
votes
4 answers

Replacing NA's in each column of matrix with the median of that column

I am trying to replace the NA's in each column of a matrix with the median of of that column, however when I try to use lapply or sapply I get an error; the code works when I use a for-loop and when I change one column at a time, what am I doing…
Jonno Bourne
  • 1,931
  • 1
  • 22
  • 45
4
votes
1 answer

How to find RMSE by using loop in R

If I have a data frame contain 3 variables : origdata <- data.frame( age <- c(22, 45, 50, 80, 55, 45, 60, 24, 18, 15), bmi <- c(22, 24, 26, 27, 28, 30, 27, 25.5, 18, 25), hyp <- c(1, 2, 4, 3, 1, 2, 1, 5, 4, 5) ) I created MCAR…
zhyan
  • 261
  • 4
  • 14
3
votes
1 answer

How impute NA values or create all possible combinations?

data.frame( group = c("a", "b", "c", "d", "e", "total"), count = c(NA, NA, 10, 21, 49, 85) ) > group count 1 a NA 2 b NA 3 c 10 4 d 21 5 e 49 6 total 85 Given the above data frame, how can I impute the…
electronix384128
  • 6,625
  • 11
  • 45
  • 67
3
votes
1 answer

Error message with missForest package (imputation using Random Forest)

My dataframe is below. All variables are numeric, one of them (Total) has about 20 NAs. I would like the missForest package to create imputed values for the NAs in Total. I am running R version 4.2.1 (2022-06-23 ucrt) on Windows. imp <-…
lawyeR
  • 7,488
  • 5
  • 33
  • 63
3
votes
2 answers

AUC of logistic and ordinal model following multiple imputation using MICE (with R)

I am asking a question concerning the additive predictive benefit of the inclusion of a variable to a logistic and an ordinal model. I am using mice to impute missing covariates and am having difficulty finding ways to calculate the AUC and R…
DW1310
  • 147
  • 7