Failing to ignore NAs in my list of files

Question

I have a list of files (from 1 to 332) inside my directory. The file1 corresponds to id1, and the file2 corresponds to id2, and so on and so forth.

Each file contains 4 columns, and I have to calculate the sums and lengths of the 2th column (labelled as "pollutant") by ignoring the NAs.

I have tried everything: !is.na(file), na.rm = TRUE, omit...It works when I want the sum and length from 1:100 or 1:60 (from the value 1 to another value), but it doesn't work from 70:72 for instance. I can't pin point the problem.

Here is the part of my code that deals with it:

pollutantmean <- function(directory,pollutant,id= 1:332){

  files <- list.files(directory)
  sums <- numeric (length(id))
  lengths <- numeric (length(id))
  means <- numeric (length(id))

  for (i in id){

      file <- read.csv(files[i])[,pollutant]
      sums[i] <- sum(file,na.rm = TRUE)
      lengths[i] <-length(file[!is.na(file)])
  }

  means <-(sum(sums)/sum(lengths))
  return(list(sums, lengths, means))

}

Thanks in advance for your help!

Would it possible to share a code snippet on which the above does fail? — Edwin, May 02 '17 at 10:26
@Edwin: I edit my question to include the entire code above. — Kathia, May 02 '17 at 12:37
@jogo Yes, i want means to be a single value. When I run the script by typing: pollutantmean(".","sulf",1:10) I got the good value of means. However, when I type: pollutantmean(".","sulf",70:72), I get the answer "NA" — Kathia, May 02 '17 at 13:42
o.k. But why you do the initialisation `means <- numeric (length(id))` ? For the other problem: please supply data so that we can reproduce the issue! Edit your question: http://stackoverflow.com/posts/43735470/edit At the current state your question is off-topic on SO. http://stackoverflow.com/help/closed-questions — jogo, May 02 '17 at 18:14
Your indexing is wrong. When you call `pollutantmean(".","sulf",70:72)`, what is the value of `length(id)` in the function `pollutantmean` ? ... and for the first value of `i` in the loop `for (i in id)` what index is it? — jogo, May 02 '17 at 18:34

score 1 · Accepted Answer · edited May 23 '17 at 12:02

Your indexing is wrong. When you call pollutantmean(".","sulf",70:72), what is the value of length(id) in the function pollutantmean? (answer: 3) ... and for the first value of i in the loop for (i in id) what index is it? (answer: 70)
Here is an example of what you are doing and what you get with the wrong indexing:

sums <- numeric(3)
sums[10] <- 42
sums
# > sums
# [1]  0  0  0 NA NA NA NA NA NA 42

... the further calculations give NA
So, the origin of the problem is the same as in your other question

Here is a clear version of your function:

pollutantmean <- function(directory, pollutant, id= 1:332) {
  files <- list.files(directory)
  L <- lapply(files[id], function(f) read.csv(f)[,pollutant])
  sums    <- sapply(L, sum, na.rm=TRUE)
  lengths <- sapply(L, function(l) sum(!is.na(l)))

  list(sums=sums, lengths=lengths, means=sum(sums)/sum(lengths))
}

Thanks a lot @jogo ! I understood my mistake and learnt about the functions sapply and lapply. — Kathia, May 02 '17 at 20:18

Failing to ignore NAs in my list of files

1 Answers1