2

I have an R data frame with data from multiple subjects, each tested several times. To perform statistics on the set, there is a factor for subject ("id") and a row for each observation (around 40,000) with around 200 variables each.

allData <- data.frame(id       = rep(1:4, 3),
                      session  = rep(1:3, each = 4),
                      measure1 = sample(c(NA, 1:11)),
                      measure2 = sample(c(NA, 1:11)),
                      measure3 = sample(c(NA, 1:11)),
                      measure4 = sample(c(NA, 1:11)))
allData                      
#    id session measure1 measure2 measure3 measure4
# 1   1       1        3        7       10        6
# 2   2       1        4        4        9        9
# 3   3       1        6        6        7       10
# 4   4       1        1        5        2        3
# 5   1       2       NA       NA        5       11
# 6   2       2        7       10        6        5
# 7   3       2        9        8        4        2
# 8   4       2        2        9        1        7
# 9   1       3        5        1        3        8
# 10  2       3        8        3        8        1
# 11  3       3       11       11       11        4
# 12  4       3       10        2       NA       NA

I need to remove all rows with id 1 and 4, given that the "measureX" (X=1,..,4) column contains NA in one of the rows for the id 1 and 4.

A solution for this problem was suggested by flodel in [https://stackoverflow.com/a/9917524/5042101][1] using the "plyr" package and the function ddply.

probeColumns = c('measure1','measure4')

library(plyr)
ddply(allData, "id",
      function(df)if(any(is.na(df[, probeColumns]))) NULL else df)

Problem. My database includes around 40,000 rows and 200 columns. An error appears when I try for a single column: C stack usage 10027284.

I am using R 3.1.3 in RStudio on Windows. When a try for more columns RStudio close up automatically or R freezes. Moreover, I do not have access to the administrator session in the computer.

Community
  • 1
  • 1
Vladimir AC
  • 103
  • 1
  • 2
  • 6

2 Answers2

0

I can't say exactly what the problem is with plyr (though it might be a bug in the package). It is possible to do this using apply:

> allData[apply(allData, 1, function(x) !any(is.na(x[probeColumns]))), ]
   id session measure1 measure2 measure3 measure4
1   1       1        1        1        2        4
2   2       1        5        4        6        1
3   3       1        9        8       NA        3
4   4       1       11        7        7        5
5   1       2        8        5       11        2
6   2       2        6       NA        5        8
7   3       2       10       10        3       10
9   1       3        4        9        4        9
10  2       3        2        6        8        7
11  3       3        3        3        9        6

A bit of explanation - apply(allData, c(1), function(x) !any(is.na(x[probeColumns]))) determines the indexes of rows that don't have NA in columns specified by probeColumns by going row by row and checking if any of the values in a row in probeColums are NA.

romants
  • 3,660
  • 1
  • 21
  • 33
0

Here is my solution a little bit clumsy maybe but here is the idea:

  1. Find out where are located the NAs
  2. then identify at which id they correspond
  3. Last step remove all id elements that have at least (in at least one column) an NA.

    ind <- allData[apply(allData, 1, function(x) sum(is.na(x))) == !0, 1 ]
    
    allData %>% filter(!id %in% ind)
      id session measure1 measure2 measure3 measure4
    1  1       1        1        6        1        8
    2  2       1       10        2        7        2
    3  1       2       11        7        5       11
    4  2       2        5        5        4        7
    5  1       3        4        8        9        5
    6  2       3        8       11        3        9
    
SabDeM
  • 7,050
  • 2
  • 25
  • 38