Questions tagged [missing-data]

For questions relating to missing data problems, which can involve special data structures, algorithms, statistical methods, modeling techniques, visualization, among other considerations.

When working with data in regular data structures (e.g. tables, matrices, arrays, tensors), some data may not be observed, may be corrupted, or may not yet be observed. Treatment of such data requires additional annotation as well as methodological considerations when deciding how to impute or use such data in standard contexts. This becomes a problem in data-intensive contexts, such as large statistical analyses of databases.

Missing data occur in many fields, from survey data to industrial data. There are many underlying missing data mechanisms (reasons why the data is missing). In survey data for example, data might be missing due to drop-out. People answering the survey might run out of time.

Rubin classified missing data into three types:

  1. missing completely at random;
  2. missing at random;
  3. missing not at random.

Note that some statistical analysis is only valid under certain class.

2809 questions
1
vote
3 answers

Inferring missing data with Restricted Boltzmann Machines

Similar to the netflix competition, assume we have a movie dataset with missing ratings. How would I modify RBM to allow it to deduce the missing values? In related papers, one straightforward way is to impute random values to the missing visible…
IssamLaradji
  • 6,637
  • 8
  • 43
  • 68
1
vote
2 answers

MATLAB: repeat elements in vector with increasing timestamp

I have a matrix with one column containing data (one sample per second) and another column with the timestamp in seconds. There is some seconds where the data doesn't change from the last one, and because of this doesn't appear on the vector. I…
XanderW
  • 13
  • 3
1
vote
1 answer

"Error in 1:ncol(x) : argument of length 0" when using Amelia in R

I am working with panel data. I have well over 6,000 country-year observations, and have specified my Amelia imputation as follows: (CountDependentVariable, m=5, ts="year", cs="cowcode", sqrts=c("OtherCountVariable2", "OtherCount3",…
ealfons1
  • 353
  • 1
  • 6
  • 24
1
vote
1 answer

R plotting a dataset with NA Values

I'm trying to plot a dataset consisting of numbers and some NA entries in R. V1,V2,V3 2, 4, 3 NA, 5, 4 NA,NA,NA NA, 7, 3 6, 6, 9 Should return the same lines in the plot, as if I had entered: V1,V2,V3 2, 4, 3 3, 5, 4 4, 6, 3.5 5, 7, 3 6, 6,…
Stephen Rewitz
  • 43
  • 1
  • 2
  • 5
1
vote
2 answers

Stata: replace missing values with existing observations

I am trying to replace missing values with values from the same column dependent on their equality from other columns: I have different firms, from different industries & countries and from different years. Below is just a small example. I would…
Franz
  • 33
  • 1
  • 2
  • 8
1
vote
2 answers

Find most recent non-missing value in a vector

I'm trying to return the most recent row in the vector with a non-missing value. For instance, given x <- c(1,2,NA,NA,3,NA,4) Then function(x) would output a list like: c(1,2,2,2,3,3,4) Very simple question, but running it with loops or brute…
canary_in_the_data_mine
  • 2,193
  • 2
  • 24
  • 28
1
vote
1 answer

Simulate Missing Data (i.e. Mask Data) in R to Test Imputation Accuracy

I want to determine a program's imputation accuracy using SNP genotype data, so I need to mask a portion of the SNP calls to simulate missing data. I've been testing my code on this subset of marker data (see below). Column names are names of…
1
vote
1 answer

How do I detect and re-insert missing data?

I have a missing row in a data table which describes a function from time, sid, and s.c to count: > dates.dt[1001:1011] sid s.c count time 1: missing CLICK 104192 2013-05-25 10:00:00 2: missing SHARE 7694 2013-05-25…
sds
  • 58,617
  • 29
  • 161
  • 278
1
vote
1 answer

How can I split a multiply imputed dataset created in Amelia?

I have imputed missing values using Amelia thereby creating 5 multiply imputed datasets. Now, I would like to split this multi-dataset, e.g. one set for year => 1990 and one set for year =<1990. Any ideas how I can do so? Many…
TiF
  • 615
  • 2
  • 12
  • 24
1
vote
2 answers

Replaced stars with NA has no effect when data is read from a function in R

I have data frame where the missing values are denoted with star sign "*". I have replaced them with > mydata[mydata == "*"] <- NA but when I use str(mydata) it shows that the missing values are still "*". Like 'data.frame': 117 obs. of 8…
ilhan
  • 8,700
  • 35
  • 117
  • 201
1
vote
1 answer

Calculate variance of frequencies when dataset does not contain entries of frequency zero

I have a dataset that has three fields: id, feature and frequency. What I want to do is find out, for a group of given id's, which feature has the largest spread of frequencies. The result I want is that if I split the group of id's into two…
user21037
1
vote
2 answers

Handling missing values when calculating matching distances with the 'proxy' package

I have a function that calculates simple matching distances in a matrix with ordinal data: require(proxy) m <- test f <- function(x,y) sum(x == y) / NROW(x) matches <- as.matrix(dist(m, f, upper=TRUE)) The problem is that this function won't work…
Werner Hertzog
  • 2,002
  • 3
  • 24
  • 36
1
vote
2 answers

How to change na.action for zero-inflated regression model?

I am running a zero-inflated negative binomial regression model using the function zeroinfl from the pscl package. I need to exclude NA's from the model in order to be able to plot the residuals against the dependent variable later in the analysis.…
Annerose N
  • 477
  • 6
  • 14
1
vote
1 answer

SharpPCap missing packets

I'm using SharpPCap to collect IEC61850-9-2LE Sampled Values over Ethernet. IEC61850-9-2LE Sampled Values consists of several streams, each one sending 4000 packets per second, where the avg packet size is 125 bytes. Using SharpPCap I'm trying to…
Lorenzo Santoro
  • 464
  • 1
  • 6
  • 16
1
vote
1 answer

Extract x-axis value using y-axis data in R

I have a time-series dataset in this format: Time Val1 Val2 0 0.68 0.39 30 0.08 0.14 35 0.12 0.07 40 0.17 0.28 45 0.35 0.31 50 0.14 0.45 100 1.01 1.31 105 0.40 1.20 110 2.02 0.57 115 1.51 0.58 130…