How can I generate missing data structures to run simulations on high dimensional data in R?

Question

In the R program, I will generate a high-dimensional dataset using the following codes and create missing datasets with MAR, MCAR and MNAR mechanisms, with 5%, 25% and 40% missing rates:

generateData<- function(n,p) {
sigma <- diag(p)
sigma <- replace(sigma, sigma == 0, 0.3)
mu= rep(0,nrow(sigma))
X <- mvrnorm(n, mu = mu, Sigma = sigma)
vCoef = rnorm(ncol(X))
vProb =exp(X%*%vCoef)/(1+exp(X%*%vCoef))
Y <- rbinom(nrow(X), 1, vProb)
data= data.frame(cbind(X,Y))
return(data)
}
data <- generateData(n = 100,p=120)
X <- data[-ncol(data)]
Y <- data[ncol(data)]

Next I will compare the performance of imputation methods. I tried using the ampute function to generate missing datasets but when I run the code I get the following error, which I think is related to pattern and weight:

result <- ampute(X, prop = 0.4, mech ='MAR', type="RIGHT", bycases=FALSE)
Error: Proportion of missing cells is too large in combination with the desired number of missing variables

While using the ampute function, I cannot make the necessary adjustments for the pattern and weight. I tried various pattern and weight values for MAR, MCAR and MNAR but it didn't work. Also, I don't know if it is necessary to create missing datasets using all of the variables or just some of the variables (for example, the first 50 variables) to create missing datasets. As imputation methods, I will use EM, KNN, random forests, regression-based methods, naive bayes, artificial neural networks as well as classical methods. Can I use it by making the necessary adjustments to the amputee function or should I use another function? Thanks in advance for your help.

Matteo Pedone · Accepted Answer · 2021-11-19T10:31:18.410

Since you’re editing some missing cross-references, I deleted my old answer (which should have been a comment instead) and am trying to be complete and summarize my answer here.

I think that problem here is due to a misuse of the argument bycases. In fact, if it is set to FALSE, the prop argument defines the proportion of missing entries in your data frame. If you set prop = .4, given the dimension of your data frame (120,000 entries) and the default pattern (where the missingness is on one variable only), you are asking for a dataframe with 4800 missing values all on one variable (that has 100 entries).

If you consider the proportion of missingness to be defined in terms of cases

data <- generateData(n = 100, p=120)
X <- data[-ncol(data)]
Y <- data[ncol(data)]

result2 <- ampute(X, prop = 0.4)
result2$prop
#[1] 0.4

no error occurs, since you are requiring 40 observations (out of 100) to have missing values on one variable (since we are still employing the default pattern).

If you want to consider bycases = FALSE you should either define a pattern that induces missingness on more than one variable, or set a proportion such that the number of missing values for a single covariate is less than the number of observations:

> result3 <- ampute(X, prop = 0.0075, bycases = FALSE)
> result3$prop
#[1] 0.9

## that is 120x100x.0075= 90 < 100

Here I report a simple script to generate the dataset you need.

rm(list=ls())
library(mice)
#> 
#> Caricamento pacchetto: 'mice'
#> Il seguente oggetto è mascherato da 'package:stats':
#> 
#>     filter
#> I seguenti oggetti sono mascherati da 'package:base':
#> 
#>     cbind, rbind
set.seed(221)

n <- 100
P <- 120
pstar <- 50
covmat <- toeplitz((P:1)/P)
npat <- 120

testdata <- MASS::mvrnorm(n = n, mu = rep(0, P), Sigma = covmat)
testdata <- as.data.frame(testdata)

myfreq <- .15 #.05 .25
mypatterns <- matrix(1, nrow = npat, ncol = P)
for(i in 1:npat){
  idx <- sample(x = 1:pstar, size = myfreq * n, replace = F)
  mypatterns[i,idx] <- 0
}
#mypatterns

result <- ampute(testdata, patterns = mypatterns)
md.pattern(result$amp)

^{Created on 2021-11-19 by the reprex package (v2.0.1)}

Session info

sessioninfo::session_info()
#> ─ Session info  ──────────────────────────────────────────────────────────────
#>  hash: person bowing, person taking bath, vampire: medium-dark skin tone
#> 
#>  setting  value
#>  version  R version 4.1.0 (2021-05-18)
#>  os       Ubuntu 20.04.2 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language (EN)
#>  collate  it_IT.UTF-8
#>  ctype    it_IT.UTF-8
#>  tz       Europe/Rome
#>  date     2021-11-19
#>  pandoc   2.11.4 @ /usr/lib/rstudio/bin/pandoc/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.1.0)
#>  backports     1.3.0   2021-10-27 [1] CRAN (R 4.1.0)
#>  broom         0.7.10  2021-10-31 [1] CRAN (R 4.1.0)
#>  cli           3.1.0   2021-10-27 [1] CRAN (R 4.1.0)
#>  crayon        1.4.2   2021-10-29 [1] CRAN (R 4.1.0)
#>  curl          4.3.2   2021-06-23 [1] CRAN (R 4.1.0)
#>  DBI           1.1.1   2021-01-15 [1] CRAN (R 4.1.0)
#>  digest        0.6.28  2021-09-23 [1] CRAN (R 4.1.0)
#>  dplyr         1.0.7   2021-06-18 [1] CRAN (R 4.1.0)
#>  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.1.0)
#>  evaluate      0.14    2019-05-28 [1] CRAN (R 4.1.0)
#>  fansi         0.5.0   2021-05-25 [1] CRAN (R 4.1.0)
#>  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.1.0)
#>  fs            1.5.0   2020-07-31 [1] CRAN (R 4.1.0)
#>  generics      0.1.1   2021-10-25 [1] CRAN (R 4.1.0)
#>  glue          1.5.0   2021-11-07 [1] CRAN (R 4.1.0)
#>  highr         0.9     2021-04-16 [1] CRAN (R 4.1.0)
#>  htmltools     0.5.2   2021-08-25 [1] CRAN (R 4.1.0)
#>  httr          1.4.2   2020-07-20 [1] CRAN (R 4.1.0)
#>  knitr         1.36    2021-09-29 [1] CRAN (R 4.1.0)
#>  lattice       0.20-44 2021-05-02 [4] CRAN (R 4.1.0)
#>  lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.1.0)
#>  magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.1.0)
#>  MASS          7.3-54  2021-05-03 [4] CRAN (R 4.0.5)
#>  mice        * 3.13.0  2021-01-27 [1] CRAN (R 4.1.0)
#>  mime          0.12    2021-09-28 [1] CRAN (R 4.1.0)
#>  pillar        1.6.4   2021-10-18 [1] CRAN (R 4.1.0)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.1.0)
#>  purrr         0.3.4   2020-04-17 [1] CRAN (R 4.1.0)
#>  R.cache       0.15.0  2021-04-30 [1] CRAN (R 4.1.0)
#>  R.methodsS3   1.8.1   2020-08-26 [1] CRAN (R 4.1.0)
#>  R.oo          1.24.0  2020-08-26 [1] CRAN (R 4.1.0)
#>  R.utils       2.11.0  2021-09-26 [1] CRAN (R 4.1.0)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.1.0)
#>  Rcpp          1.0.7   2021-07-07 [1] CRAN (R 4.1.0)
#>  reprex        2.0.1   2021-08-05 [1] CRAN (R 4.1.0)
#>  rlang         0.4.12  2021-10-18 [1] CRAN (R 4.1.0)
#>  rmarkdown     2.11    2021-09-14 [1] CRAN (R 4.1.0)
#>  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.1.0)
#>  sessioninfo   1.2.1   2021-11-02 [1] CRAN (R 4.1.0)
#>  stringi       1.7.5   2021-10-04 [1] CRAN (R 4.1.0)
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.1.0)
#>  styler        1.6.2   2021-09-23 [1] CRAN (R 4.1.0)
#>  tibble        3.1.6   2021-11-07 [1] CRAN (R 4.1.0)
#>  tidyr         1.1.4   2021-09-27 [1] CRAN (R 4.1.0)
#>  tidyselect    1.1.1   2021-04-30 [1] CRAN (R 4.1.0)
#>  utf8          1.2.2   2021-07-24 [1] CRAN (R 4.1.0)
#>  vctrs         0.3.8   2021-04-29 [1] CRAN (R 4.1.0)
#>  withr         2.4.2   2021-04-18 [1] CRAN (R 4.1.0)
#>  xfun          0.28    2021-11-04 [1] CRAN (R 4.1.0)
#>  xml2          1.3.2   2020-04-23 [1] CRAN (R 4.1.0)
#>  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.1.0)
#> 
#>  [1] /home/matt/R/x86_64-pc-linux-gnu-library/4.1
#>  [2] /usr/local/lib/R/site-library
#>  [3] /usr/lib/R/site-library
#>  [4] /usr/lib/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

Thanks. Your answer was very helpful for me to see my mistake. What I want to do is create 5%, 15%, 25% missing data per variable. If 50 out of 120 variables will contain missing observations, I want to use these ratios for each of these 50 variables. I'm having a hard time creating a suitable pattern for this. Maybe I'll have to use another function besides the ampute. — Bugra Varol, Nov 18 '21 at 06:29
I think the [documentation](https://rianneschouten.github.io/mice_ampute/vignette/ampute.html) is really clear and you can easily construct your dataset using the `ampute` function. I modified my answer adding a reprex as an example. — Matteo Pedone, Nov 19 '21 at 10:34
Yes, a Toeplitz matrix is a matrix in which each descending diagonal from left to right is constant. This is an arbitrary choice! I am glad you found it helpful, if this solves your problem you can accept it as an answer and/or upvote the answer, as you have already done, I think! — Matteo Pedone, Nov 22 '21 at 07:46
I have disturbed you a lot, but I want to ask you one last thing with my apologies: We randomly assigned 15 missing values out of 50 variables by size = myfreq * n for each of the 120 observations, but I could not mathematically understand how we reached the 15% missing rate per variable after applying the ampute. In addition, the deficiency rate here is approximately 15%, not exactly 15%. I guess it is not possible to arrange this pattern structure in such a way that it will reduce 15 out of 100 observations per variable precisely. — Bugra Varol, Nov 22 '21 at 09:43
_We randomly assigned ... for each of the 120 observations_ Actually, we are working on patterns. We created 100 patterns and each of them produces missingness in 50 out of 120 variables. Each variable will have roughly 15% of the observation not reporting the value. _15 out of 100 observations per variable precisely_ I think you can't have exactly 15%, due to the randomness in the multivariate amputation procedure implemented in `ampute`. I suggest you to read [here](https://rianneschouten.github.io/mice_ampute/vignette/ampute.html#multivariate_amputation) and reference therein. — Matteo Pedone, Nov 22 '21 at 11:28
Thanks again for your useful information. You've helped me a lot. — Bugra Varol, Nov 22 '21 at 14:40

score 0 · Answer 2 · answered Nov 16 '21 at 14:15

I think you should provide a MWE in order to reproduce your error. Since I don't know what mypattern is like, I can't help you!

Thanks for the answer! Even though I tried different patterns, I got the same error every time. For example, I tried a pattern like the one below to create missing data in the first 50 variables:

############### PATTERN ###########################
a <- ncol(X)
b <- 50
mypattern <- matrix(rep(1, a*b), ncol=a, nrow=b)
for(i in 1:b) {
mypattern[i,i] = 0
}

Also the problem persisted when i created pattern by default. I updated my codes above as default pattern.

How can I generate missing data structures to run simulations on high dimensional data in R?

2 Answers2