Questions tagged [imputation]

Missing data imputation is the process of replacing missing data with substituted, 'best guess', values. Because missing data can create problems for analyzing data and can lead to missing-data bias, imputation is seen as a way to avoid the problems associated with listwise deletion (ignoring all observations with any missing values).

Missing data imputation is the process of replacing missing data with substituted, 'best guess', values. Because missing data can create problems for analyzing data and can lead to missing-data bias, imputation is seen as a way to avoid the problems associated with listwise deletion (ignoring all observations with any missing values). Multiple methods for imputation exist, including: imputing missing values with a single value, such as the mean or median or some specific value based on domain-expertise; distance based heuristics such as kNN; stochastic averaging via multiple imputation; and model-based methods including Expectation Maximization (EM).

Suggested tag synonym: "missing-data"

931 questions
5
votes
2 answers

Fastest way to impute column means with large data

I have a large numeric dataset (~700 rows, 350,000 columns, reading in as a data.table in R) containing some NA's that I would like to replace with column means as quickly as possible. I found a previous post that replaces NA's with 0, but when I…
bfd
  • 65
  • 5
5
votes
1 answer

Exclude missing values from model performance calculation

I have a dataset and I want to build a model, preferably with the caret package. My data is actually a time series but the question is not specific to time series, it's just that I work with CreateTimeSlices for the data partition. My data has a…
agenis
  • 8,069
  • 5
  • 53
  • 102
5
votes
2 answers

imputing missing values using a predictive model

I am trying to impute missing values in Python and sklearn does not appear to have a method beyond average (mean, median, or mode) imputation. Orange imputation model seems to provide a viable option. However, it appears Orange.data.Table is not…
sedeh
  • 7,083
  • 6
  • 48
  • 65
5
votes
3 answers

Imputation in R

I am new in R programming language. I just wanted to know is there any way to impute null values of just one column in our dataset. Because all of imputation commands and libraries that I have seen, impute null values of the whole dataset.
Mehrdad Rohani
  • 181
  • 1
  • 1
  • 12
4
votes
1 answer

sklearn imputer drop column with missing values

I am learning currently about sklearn imputer and I found that there is one strategy that isn't implemented by the imputers. I would like to build a pipeline that deletes the columns with any missing values or delete all the rows with missing…
Espoir Murhabazi
  • 5,973
  • 5
  • 42
  • 73
4
votes
0 answers

Using mice inputed data sets in GLM analysis; can pooled model fit indices be obtained?

I used mice to impute five missing data sets, saved as the object "allImputations" in the code below. I then needed to complete linear and dichotomous regression analyses across the imputed data sets (see below for a successful…
Clar_k
  • 41
  • 2
4
votes
2 answers

Understanding sklearn's KNNImputer

I was going through its documentation and it says Each sample’s missing values are imputed using the mean value from n_neighbors nearest neighbors found in the training set. Two samples are close if the features that neither are missing are…
4
votes
1 answer

Hot Deck Imputation in Python

I have been trying to find Python code that would allow me to replace missing values in a dataframe's column. The focus of my analysis is in biostatistics so I am not comfortable with replacing values using means/medians/modes. I would like to apply…
Zakariah Siyaji
  • 989
  • 8
  • 27
4
votes
1 answer

Imputation methods in mice - correlation in data set. R

Im struggling with an imputation using mice. The main objective is to impute NAs (if possible by group). As the sample is a bit large to simple post here it is downloadable: https://drive.google.com/open?id=1InGJ_M7r5jwQZZRdXBO1MEbKB48gafbP My…
Juan
  • 171
  • 1
  • 12
4
votes
1 answer

How does the Multivariate imputer in scikit-learn differ from the Simple imputer?

I have a matrix of data with missing values that I am trying to impute, and I am looking at the options for different imputers and checking to see what settings would work best for the biological context I am working in. I understand the knnimpute…
Kangaroo
  • 41
  • 1
  • 2
4
votes
2 answers

How to deal with NaN values where imputation doesn't make sense? (for PCA)

I am having a hard time figuring out how to deal with NaN variables where data imputation doesn't make sense. I am trying to do text/document clustering and there are some missing values that needs to stay as missing because there is no sensible way…
MehmedB
  • 1,059
  • 1
  • 16
  • 42
4
votes
2 answers

Multiple imputation in R (mice) - How do I test imputation runs?

I work with a data set of 171 observations of 55 variables with 35 variables having NA's that I want to impute with the mice function: imp_Data <- mice(Data,m=5,maxit=50,meth='pmm',seed=500) imp_Data$imp Now, having the 5 imputation runs, I don't…
Marie-Lu
  • 41
  • 2
4
votes
2 answers

Forward fill column with an index-based limit

I want to forward fill a column and I want to specify a limit, but I want the limit to be based on the index---not a simple number of rows like limit allows. For example, say I have the dataframe given by: df = pd.DataFrame({ 'data': [0.0, 1.0,…
alkasm
  • 22,094
  • 5
  • 78
  • 94
4
votes
1 answer

How to use cross validation after imputing on a training and validation set?

So I've gotten myself a little confused. At the moment, I've got a dataset of about 800 instances. I've split it into a training and validation set because there were missing values so I used SimpleImputer from sklearn and fit_transform-ed the…
Alexia M
  • 41
  • 1
4
votes
3 answers

Imputer on some columns in a Dataframe

I am trying to use Imputer on a singe column called age to replace missing values.But I get the error as " Expected 2D array, got 1D array instead:" Following is my code import pandas as pd import numpy as np from sklearn.preprocessing import…
Mitesh
  • 43
  • 1
  • 3