Calculating a mean from data held in multiple files

Question

I am trying to write an R script that calculates the mean of a specified pollutant (nitrate or sulfate) based on data from one or more of 332 monitor stations. The data from each station is held in a separate file, numbered 1:332. I am new to R and, to be fair to anyone who chooses to help me, I should say that this is a homework problem. I have written the script below, which works for just one file:

pollutantmean <- function(directory, pollutant, id = 1:332) {
    filepath <- "/Users/jim/Documents/Coursera/2_R_Prog/Data"
    for(i in seq_along(id)) {
            if(id < 10) {
                    name <- paste("00", id[i], sep = "")
            }
            if(id >= 10 && id < 100) {
                    name <- paste("0", id[i], sep = "")
            } 
            if(id >= 100) {
                    name <- id[i]
            }    
    }
    file <- paste(name, "csv", sep = ".")
    station <- paste(filepath, directory, file, sep = "/")
    monitor <- read.csv(station)
    if(pollutant == "nitrate") {
            x <- mean(monitor$nitrate, na.rm = T)
    }
    if(pollutant == "sulfate") {
            x <- mean(monitor$sulfate, na.rm = T)
    }
    x
}

However, if I enter more than one file (eg 70:72) I get the mean for the last file only (72). This suggests to me that it is calculating the mean for each file and then overwriting it with the mean of the next, so that only the last is outputted. I would be able to solve this using rbind(), but I can't figure out how to assign unique names for each variable which would then become the arguments for rbind(). I would be grateful for any help anyone can offer. Cheers, Jim

http://stackoverflow.com/questions/23640594/reading-multiple-files-and-calculating-mean-based-on-user-input — user227710, Jun 13 '15 at 21:29
Thank you for your help, Julien. You have given me useful advice about 'sprintf' and working with loops. However, your code gives the same number of means as the 'length(id)'. What I need at the end is just one mean, so I need to find some way of putting all the data into a single vector and then calculating a mean from that. — Jim Camp, Jun 14 '15 at 06:14

Julien Navarre · Answer 1 · 2015-06-14T09:55:54.327

You don't loop over the files.

And you get the mean of the last file because when you loop over ids to create names, your loop returns the last name created.

You should create a vector of names then stations and loop over it !

Tips : You don't need a loop and conditional statements to create your names, you could use sprintf precising the size of the string you are expected (3) and what with you want to "expand" the string (0)

> id <- c(1, 10, 100)
> names <- sprintf("%03d", id)
> names
[1] "001" "010" "100"

And this should works :

pollutantmean <- function(directory, pollutant, id = 1:332) {
  filepath <- "/Users/jim/Documents/Coursera/2_R_Prog/Data"

  names <- sprintf("%03d", id)
  files <- paste0(names, ".csv") # Or directly : files <- sprintf("%03d.csv", id)
  station <- file.path(filepath, directory, files)

  means <- numeric(length(station))

  for (i in seq_along(station)) {
    monitor <- read.csv(station[i])
    if(pollutant == "nitrate") {
      means[i] <- mean(monitor$nitrate, na.rm = T)
    } else if(pollutant == "sulfate") {
      means[i] <- mean(monitor$sulfate, na.rm = T)
    }
  }
  return(means)
}

EDIT : If you want a single mean, you can use the code above and ponderate each means by the nrow non NA. Replace the loop by :

means <- numeric(length(station))
counts <- numeric(length(station))

for (i in seq_along(station)) {
  monitor <- read.csv(station[i])
  if(pollutant == "nitrate") {
    means[i] <- mean(monitor$nitrate, na.rm = TRUE)
    counts[i] <- sum(!is.na(monitor$nitrate))
  } else if(pollutant == "sulfate") {
    means[i] <- mean(monitor$sulfate, na.rm = TRUE)
    counts[i] <- sum(!is.na(monitor$sulfate))
  }
}

myMean <- sum(means * counts) / sum(counts)
return(myMean)

Since your first intention was to gather your datas into one vector, here is a solution that create a list in which each element is the desire "pollutant" variable of each datasframes, unlist gather all the vectors into 1 and then we can compute the mean on this vector.

pollutantmean <- function(directory, pollutant, id = 1:332) {
  filepath <- "/Users/jim/Documents/Coursera/2_R_Prog/Data"

  names <- sprintf("%03d", id)
  files <- paste0(names, ".csv") # Or directly : files <- sprintf("%03d.csv", id)
  station <- file.path(filepath, directory, files)

  li <- lapply(station, function(x) {
    monitor <- read.csv(x)
    if(pollutant == "nitrate") {
      monitor$nitrate
    } else if(pollutant == "sulfate") {
      monitor$sulfate
    }
  })

  myMean <- mean(unlist(li))

  return(myMean)
}

score 0 · Answer 2 · answered Nov 04 '15 at 00:22

A small correction in Julien Navarre's 2nd pollutantmean function. When calculating the mean, it is not ignoring the NA values, which could affect the overall result. So the line calculating the mean value should be like this.

myMean <- mean(unlist(l), na.rm=TRUE)

Calculating a mean from data held in multiple files

2 Answers2