Rewriting a function, to find mean of column in set of tables, returns some results consistent with old program and some slightly different

Question

I'm currently writing a program (full disclosure, it's "homework"). The program is designed to run through a series of files based on a range given, collate them into one large table sans NAs and find the mean of the pollutant provided (which is a column in the table).

I wrote the program previously, but wanted to play around with compartmentalising the functions a bit more, so I rewrote it.

Strangely, some ranges return the exact numeric as in the original program, while others return (relatively) radically different results.

For instance:

pollutantmean("specdata", "sulfate", 1:10)

Old Program: 4.064128

New Program: 4.064128

pollutantmean("specdata", "nitrate", 23)

Old Program: 1.280833

New Program: 1.280833

pollutantmean("specdata", "nitrate", 70:72)

Old Program: 1.706047

New Program: 1.732979

In that final example, the old program is producing the expected result, while the new program is producing a result not within the acceptable margin of error at all.

I'm simply at a loss, I've been trying to rewrite my new code so as to minimise differences with the old cold without simply reproducing the old program, and the current code will be below (with the original program). But nothing is working, I continue to receive the exact same (bad) result despite quite a few changes being made.

New Program:

concatTables <- function(directory, id, hasHeader = TRUE, keepNAs = FALSE) {
      totalTable <- NULL
      currentTable <- NULL
      for (file in id) {
            filename <- paste( sep ="",
                               directory,"/",formatC(file,width=3,format="d",flag="0"),".csv"
            );
            currentTable <- read.csv(file = filename, header = hasHeader);
            
            if (!is.null(totalTable)) {
                  totalTable <- rbind(totalTable, currentTable);
            }
            else {
                  totalTable <- currentTable;
            }
      }
      if (!keepNAs) {
            totalTable <- completeRows(totalTable);
      }
      totalTable
}

completeRows <- function(table) {
      table <- table[complete.cases(table),]
      table
}

pollutantmean <- function(directory = paste(getwd(),"/specdata",sep = ""), pollutant, id = 1:332, hasHeader = TRUE, keepNAs = FALSE) {
      table <- NULL
      table <- concatTables(directory,id,hasHeader,keepNAs);
      tableMean <- mean(table[[pollutant]]);
      tableMean
}

Old Program

(Which produces better results)

dataFileName <- NULL

pollutantmean <- function(directory = "specdata", pollutant, id = 1:332, idWidth = 3, fullLoop = TRUE) {
    dataFrame <- NULL
    dataFrameTotal <- NULL
    for (i in id) {
        dataFileName <- paste(directory, "/", formatC(i, width = idWidth, flag = 0), ".csv", sep = "")
        if (!is.null(dataFileName)) {
            dataFileConnection <- file(dataFileName)
            dataFrame <- read.csv(dataFileConnection, header = TRUE)
            dataFrameTotal <- rbind(dataFrame, dataFrameTotal)
            
            
            ##close(dataFileConnection)
            if (fullLoop == FALSE) {
                break
            }
        }
        else print("DATAFILENAME IS NULL!")
    }
    print(mean(dataFrameTotal[[pollutant]], na.rm = TRUE))
}

Probably the difference is in `complete.cases()`. Try `keepNAs = TRUE` and add `na.rm = TRUE` into mean func — Andriy T., Jul 22 '15 at 07:08

score 0 · Accepted Answer · answered Jul 22 '15 at 07:15

0

The difference is that complete.cases() returns TRUE on each row where one of the columns is NA, while na.rm arg inside mean func will remove rows where the selected column (vector) is NA.

Example:

x <- airquality[1:10, -1]
x[3,3] <- NA

> mean(x[complete.cases(x), "Temp"]) == mean(x[["Temp"]], na.rm = T)
[1] FALSE

Note that complete.cases() returns TRUE on rows 5, 6 where Solar.R column is NA, so you lose 2 observations not NA in Temp column

answered Jul 22 '15 at 07:15

Andriy T.

2,020
12
23

1

Ah, I see, so complete.cases() means that I was removing any row in the totalTable that has an NA, despite some of those rows having data for that particular column? That makes perfect sense, it should have been obvious, but I guess sometimes it just takes a second set of eyes. Thanks for your time and help. – NinKenDo Jul 22 '15 at 07:25
Exactly. I think it is a common problem of all programmers :) In this case I usually take a rest and then repite a review as if it's a first time I see it. Mostly it works :) Glad to help you – Andriy T. Jul 22 '15 at 07:42

Rewriting a function, to find mean of column in set of tables, returns some results consistent with old program and some slightly different

1 Answers1