Questions tagged [missing-data]

For questions relating to missing data problems, which can involve special data structures, algorithms, statistical methods, modeling techniques, visualization, among other considerations.

When working with data in regular data structures (e.g. tables, matrices, arrays, tensors), some data may not be observed, may be corrupted, or may not yet be observed. Treatment of such data requires additional annotation as well as methodological considerations when deciding how to impute or use such data in standard contexts. This becomes a problem in data-intensive contexts, such as large statistical analyses of databases.

Missing data occur in many fields, from survey data to industrial data. There are many underlying missing data mechanisms (reasons why the data is missing). In survey data for example, data might be missing due to drop-out. People answering the survey might run out of time.

Rubin classified missing data into three types:

  1. missing completely at random;
  2. missing at random;
  3. missing not at random.

Note that some statistical analysis is only valid under certain class.

2809 questions
12
votes
2 answers

'NaTType' object has no attribute 'days'

I have a column in my dataset which represents a date in ms and sometimes its values is nan (actually my columns is of type str and sometimes its valus is 'nan'). I want to compute the epoch in days of this column. The problem is that when doing the…
Ruggero Turra
  • 16,929
  • 16
  • 85
  • 141
12
votes
2 answers

NA in clustering functions (kmeans, pam, clara). How to associate clusters to original data?

I need to cluster some data and I tried kmeans, pam, and clara with R. The problem is that my data are in a column of a data frame, and contains NAs. I used na.omit() to get my clusters. But then how can I associate them with the original data? The…
Bakaburg
  • 3,165
  • 4
  • 32
  • 64
12
votes
4 answers

visual structure of a data.frame: locations of NAs and much more

I want to represent the structure of a data frame (or matrix, or data.table whatever) on a single plot with color-coding. I guess that could be very useful for many people handling various types of data, to visualize it in a single glance. Perhaps…
agenis
  • 8,069
  • 5
  • 53
  • 102
12
votes
2 answers

Identifying rows in data.frame with only NA values in R

I have a data.frame with 15,000 observations of 34 ordinal and NA variables. I am performing clustering for a market segmentation study and need the rows with only NAs removed. After taking out the userID I got an error message saying to omit 2099…
Scott Davis
  • 983
  • 6
  • 22
  • 43
12
votes
1 answer

Return FALSE for duplicated NA values when using the function duplicated()

just wondering why duplicated behaves the way it does with NAs: > duplicated(c(NA,NA,NA,1,2,2)) [1] FALSE TRUE TRUE FALSE FALSE TRUE where in fact > NA == NA [1] NA is there a way to achieve that duplicated marks NAs as false, like this? >…
jamborta
  • 5,130
  • 6
  • 35
  • 55
11
votes
1 answer

Efficient handling of sparsely missing data in Haskell

I am trying to use Haskell for data analysis. Because my datasets are reasonably large (hundreds of thousands and potentially millions of observations), I would ideally like to use an unboxed data structure for efficiency, say Data.Vector.Unboxed.…
Bilal Barakat
  • 1,405
  • 2
  • 9
  • 11
11
votes
3 answers

How do I deal with NAs in residuals in a regression in R?

So I am having some issues with some NA values in the residuals of a lm cross sectional regression in R. The issue isn't the NA values themselves, it's the way R presents them. For example: test$residuals # 1 2 4 …
c00kiemonster
  • 22,241
  • 34
  • 95
  • 133
11
votes
2 answers

Impute missing data with mean by group

I have a categorical variable with three levels (A, B, and C). I also have a continuous variable with some missing values on it. I would like to replace the NA values with the mean of its group. This is, missing observations from group A has to be…
JOG
  • 113
  • 1
  • 5
11
votes
1 answer

Elasticsearch match phrase prefix not matching all terms

I am having an issue where when I use the match_phrase_prefix query in Elasticsearch, it is not returning all the results I would expect it to, particularly when the query is one word followed by one letter. Take this index mapping (this is a…
Paul T Davies
  • 2,527
  • 2
  • 22
  • 39
11
votes
4 answers

Function to change blanks to NA

I'm trying to write a function that turns empty strings into NA. A summary of one of my column looks like this: a b 12 210 468 I'd like to change the 12 empty values to NA. I also have a few other factor columns for which I'd like to…
Travis Heeter
  • 13,002
  • 13
  • 87
  • 129
11
votes
4 answers

Imputer on some Dataframe columns in Python

I am learning how to use Imputer on Python. This is my code: df=pd.DataFrame([["XXL", 8, "black", "class 1", 22], ["L", np.nan, "gray", "class 2", 20], ["XL", 10, "blue", "class 2", 19], ["M", np.nan, "orange", "class 1", 17], ["M", 11, "green",…
Mauro Gentile
  • 1,463
  • 6
  • 26
  • 37
11
votes
3 answers

xgboost: handling of missing values for split candidate search

in section 3.4 of their article, the authors explain how they handle missing values when searching the best candidate split for tree growing. Specifically, they create a default direction for those nodes with, as splitting feature, one with missing…
pmarini
  • 121
  • 1
  • 1
  • 6
11
votes
3 answers

Find entities with missing attributes in Datomic

If I have the following Datomic database: { :fred :age 42 } { :fred :likes :pizza } { :sally :age 42 } How do I query for both entities (:fred and :sally), getting back the attribute :likes :pizza for :fred and an empty value for :sally? The…
Ralph
  • 31,584
  • 38
  • 145
  • 282
11
votes
5 answers

String format with optional dict key-value

Is there any way to format string with dict but optionally without key errors? This works fine: opening_line = '%(greetings)s %(name)s !!!' opening_line % {'greetings': 'hello', 'name': 'john'} But let's say I don't know the name, and I would like…
Nikhil Rupanawar
  • 4,061
  • 10
  • 35
  • 51
11
votes
3 answers

R package caret confusionMatrix with missing categories

I am using the function confusionMatrix in the R package caret to calculate some statistics for some data I have. I have been putting my predictions as well as my actual values into the table function to get the table to be used in the…
Barker
  • 2,074
  • 2
  • 17
  • 31