Questions tagged [imputation]

Missing data imputation is the process of replacing missing data with substituted, 'best guess', values. Because missing data can create problems for analyzing data and can lead to missing-data bias, imputation is seen as a way to avoid the problems associated with listwise deletion (ignoring all observations with any missing values).

Missing data imputation is the process of replacing missing data with substituted, 'best guess', values. Because missing data can create problems for analyzing data and can lead to missing-data bias, imputation is seen as a way to avoid the problems associated with listwise deletion (ignoring all observations with any missing values). Multiple methods for imputation exist, including: imputing missing values with a single value, such as the mean or median or some specific value based on domain-expertise; distance based heuristics such as kNN; stochastic averaging via multiple imputation; and model-based methods including Expectation Maximization (EM).

Suggested tag synonym: "missing-data"

931 questions
12
votes
2 answers

Pandas: How to fill null values with mean of a groupby?

I have a dataset will some missing data that looks like this: id category value 1 A NaN 2 B NaN 3 A 10.5 4 C NaN 5 A 2.0 6 B 1.0 I need to fill in the…
sfactor
  • 12,592
  • 32
  • 102
  • 152
11
votes
2 answers

Impute missing data with mean by group

I have a categorical variable with three levels (A, B, and C). I also have a continuous variable with some missing values on it. I would like to replace the NA values with the mean of its group. This is, missing observations from group A has to be…
JOG
  • 113
  • 1
  • 5
11
votes
4 answers

Imputer on some Dataframe columns in Python

I am learning how to use Imputer on Python. This is my code: df=pd.DataFrame([["XXL", 8, "black", "class 1", 22], ["L", np.nan, "gray", "class 2", 20], ["XL", 10, "blue", "class 2", 19], ["M", np.nan, "orange", "class 1", 17], ["M", 11, "green",…
Mauro Gentile
  • 1,463
  • 6
  • 26
  • 37
10
votes
4 answers

R: replace NA with item from vector

I am trying to replace some missing values in my data with the average values from a similar group. My data looks like this: X Y 1 x y 2 x y 3 NA y 4 x y And I want it to look like this: X Y 1 x y 2 x y 3 y y 4 x …
gregmacfarlane
  • 2,121
  • 3
  • 24
  • 53
10
votes
1 answer

Multiple Imputation of missing and censored data in R

I have a dataset with both missing-at-random (MAR) and censored data. The variables are correlated and I am trying to impute the missing data conditionally so that I can estimate the distribution parameters for a correlated multivariate normal…
chelsea
  • 117
  • 4
9
votes
2 answers

Implementing KNN imputation on categorical variables in an sklearn pipeline

I am implementing a pre-processing pipeline using sklearn's pipeline transformers. My pipeline includes sklearn's KNNImputer estimator that I want to use to impute categorical features in my dataset. (My question is similar to this thread but it…
LazyEval
  • 769
  • 1
  • 8
  • 22
9
votes
0 answers

Use of statsmodels.imputation.mice

I am exploring statsmodels.imputation.mice package to use for imputing missing values. I haven't seen any example of its usage, though, outside of http://www.statsmodels.org. From what I gather, one would create an instance of mice.MICEData and use…
David Makovoz
  • 1,766
  • 2
  • 16
  • 27
8
votes
1 answer

Using imputed datasets from library mice() to fit a multi-level model in R

I'm new to package mice in R. But I'm trying to impute 5 datasets from popmis and then fit an lmer() model with() each and finally pool() across them. I think the pool() function in mice() doesn't work with the lmer() call from lme4 package,…
rnorouzian
  • 7,397
  • 5
  • 27
  • 72
8
votes
4 answers

MCAR Little's test in Python

How can I execute Little's Test, to find MCAR in Python? I have looked at the R package for the same test, but I want to do it in Python. Is there an alternate approach to test MCAR?
8
votes
3 answers

Implementation of sklearn.impute.IterativeImputer

Consider data which contains some nan below: Column-1 Column-2 Column-3 Column-4 Column-5 0 NaN 15.0 63.0 8.0 40.0 1 60.0 51.0 NaN 54.0 31.0 2 15.0 17.0 55.0 80.0 NaN 3 54.0 43.0 70.0 16.0 …
k.ko3n
  • 954
  • 8
  • 26
8
votes
1 answer

Differences between sklearn's SimpleImputer and Imputer

In python's sklearn library there exist two classes, which are doing approximately the same things: sklearn.preprocessing.Imputer and sklearn.impute.SimpleImputer The only difference that I found is a "constant" strategy type in SimpeImputer. Is…
MefAldemisov
  • 867
  • 10
  • 21
8
votes
1 answer

Do imputation in R when mice returns error that "system is computationally singular"

I am trying to do imputation to a medium size dataframe (~100,000 rows) where 5 columns out of 30 have NAs (a large proportion, around 60%). I tried mice with the following code: library(mice) data_3 = complete(mice(data_2)) After the first…
user8270077
  • 4,621
  • 17
  • 75
  • 140
7
votes
3 answers

Generate larger synthetic dataset based on a smaller dataset in Python

I have a dataset with 21000 rows (data samples) and 102 columns (features). I would like to have a larger synthetic dataset generated based on the current dataset, say with 100000 rows, so I can use it for machine learning purposes thereby. I've…
JChat
  • 784
  • 2
  • 13
  • 33
7
votes
3 answers

Can I use Train AND Test data for Imputation?

Interestingly, I see a lot of different answers about this both on stackoverflow and other sites: While working on my training data set, I imputed missing values of a certain column using a decision tree model. So here's my question. Is it fair to…
Analysa
  • 91
  • 1
  • 8
7
votes
3 answers

Error in "missforest" in R

Need help to get around the below error while performing data imputation in R using "missforest" package. > imputed<- missForest(dummy, maxiter = 10, ntree = 100, variablewise = TRUE, + decreasing = TRUE, verbose = TRUE, + …
Sandeep
  • 81
  • 1
  • 10
1
2
3
62 63