Questions tagged [missing-data]

For questions relating to missing data problems, which can involve special data structures, algorithms, statistical methods, modeling techniques, visualization, among other considerations.

When working with data in regular data structures (e.g. tables, matrices, arrays, tensors), some data may not be observed, may be corrupted, or may not yet be observed. Treatment of such data requires additional annotation as well as methodological considerations when deciding how to impute or use such data in standard contexts. This becomes a problem in data-intensive contexts, such as large statistical analyses of databases.

Missing data occur in many fields, from survey data to industrial data. There are many underlying missing data mechanisms (reasons why the data is missing). In survey data for example, data might be missing due to drop-out. People answering the survey might run out of time.

Rubin classified missing data into three types:

  1. missing completely at random;
  2. missing at random;
  3. missing not at random.

Note that some statistical analysis is only valid under certain class.

2809 questions
16
votes
4 answers

Exporting ints with missing values to csv in Pandas

When saving a Pandas DataFrame to csv, some integers are getting converted in floats. It happens where a column of floats has missing values (np.nan). Is there a simple way to avoid it? (Especially in an automatic way - I often deal with many…
Piotr Migdal
  • 11,864
  • 9
  • 64
  • 86
15
votes
4 answers

Replace a string value with NaN in pandas data frame - Python

Do I have to replace the value? with NaN so you can invoke the .isnull () method. I have found several solutions but some errors are always returned. Suppose: data = pd.DataFrame([[1,?,5],[?,?,4],[?,32.1,1]]) and if I try: pd.data.replace('?',…
stefanodv
  • 463
  • 3
  • 11
  • 20
15
votes
3 answers

Dataset in base R with missing values

Are there any examples of dataset in base R that contain missing values? I've been looking through each one in turn and also searched using google-nothing so far. library(MASS) data() Edit: I know how to add missing values to a dataset in R, I…
John_dydx
  • 951
  • 1
  • 14
  • 27
15
votes
2 answers

Why is there no NA_logical_

From help("NA"): There are also constants NA_integer_, NA_real_, NA_complex_ and NA_character_ of the other atomic vector types which support missing values: all of these are reserved words in the R language. My question is why there is no…
Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235
15
votes
3 answers

Specify different types of missing values (NAs)

I'm interested to specify types of missing values. I have data that have different types of missing and I am trying to code these values as missing in R, but I am looking for a solution were I can still distinguish between them. Say I have some data…
Eric Fail
  • 8,191
  • 8
  • 72
  • 128
14
votes
6 answers

Python: create a new column from existing columns

I am trying to create a new column based on both columns. Say I want to create a new column z, and it should be the value of y when it is not missing and be the value of x when y is indeed missing. So in this case, I expect z to be [1, 8, 10, 8]. …
Kexin Xu
  • 691
  • 3
  • 10
  • 20
14
votes
8 answers

Coalesce two string columns with alternating missing values to one

I have a data frame with two columns "a" and "b" with alternating missing values (NA) a b dog mouse cat bird I want to "merge" / combine them to a new column c that looks like this, i.e. the non-NA element in each…
ben_aaron
  • 1,504
  • 2
  • 19
  • 39
13
votes
2 answers

Preserve NaN values in pandas boolean comparisons

I have two boolean columns A and B in a pandas dataframe, each with missing data (represented by NaN). What I want is to do an AND operation on the two columns, but I want the resulting boolean column to be NaN if either of the original columns is…
Will Bryant
  • 521
  • 5
  • 17
13
votes
3 answers

Pandas: groupby forward fill with datetime index

I have a dataset that has two columns: company, and value. It has a datetime index, which contains duplicates (on the same day, different companies have different values). The values have missing data, so I want to forward fill the missing data with…
sapo_cosmico
  • 6,274
  • 12
  • 45
  • 58
13
votes
7 answers

Filling missing data by random choosing from non missing values in pandas dataframe

I have a pandas data frame where there are a several missing values. I noticed that the non missing values are close to each other. Thus, I would like to impute the missing values by randomly choosing the non missing values. For instance: import…
Donald Gedeon
  • 325
  • 1
  • 2
  • 12
13
votes
7 answers

Randomly insert NAs into dataframe proportionaly

I have a complete dataframe. I want to 20% of the values in the dataframe to be replaced by NAs to simulate random missing data. A <- c(1:10) B <- c(11:20) C <- c(21:30) df<- data.frame(A,B,C) Can anyone suggest a quick way of doing that?
Filly
  • 713
  • 12
  • 23
13
votes
2 answers

Pandas rolling apply with missing data

I want to do a rolling computation on missing data. Sample Code: (For sake of simplicity I'm giving an example of a rolling sum but I want to do something more generic.) foo = lambda z: z[pandas.notnull(z)].sum() x = np.arange(10, dtype="float") …
Mahesh
  • 131
  • 1
  • 5
12
votes
6 answers

Filling missing levels

I have the following type of dataframe: Country <- rep(c("USA", "AUS", "GRC"),2) Year <- 2001:2006 Level <- c("rich","middle","poor",rep(NA,3)) df <- data.frame(Country, Year,Level) df Country Year Level 1 USA 2001 rich 2 AUS 2002…
msh855
  • 1,493
  • 1
  • 15
  • 36
12
votes
9 answers

How to remove columns with too many missing values in Python

I'm working on a machine learning problem in which there are many missing values in the features. There are 100's of features and I would like to remove those features that have too many missing values (it can be features with more than 80% missing…
HHH
  • 6,085
  • 20
  • 92
  • 164
12
votes
2 answers

Pandas: How to fill null values with mean of a groupby?

I have a dataset will some missing data that looks like this: id category value 1 A NaN 2 B NaN 3 A 10.5 4 C NaN 5 A 2.0 6 B 1.0 I need to fill in the…
sfactor
  • 12,592
  • 32
  • 102
  • 152