Reading multiple files and calculating mean based on user input

Question

I am trying to write a function in R which takes 3 inputs:

Directory
pollutant
id

I have a directory on my computer full of CSV's files i.e. over 300. What this function would do is shown in the below prototype:

pollutantmean <- function(directory, pollutant, id = 1:332) {
        ## 'directory' is a character vector of length 1 indicating
        ## the location of the CSV files

        ## 'pollutant' is a character vector of length 1 indicating
        ## the name of the pollutant for which we will calculate the
        ## mean; either "sulfate" or "nitrate".

        ## 'id' is an integer vector indicating the monitor ID numbers
        ## to be used

        ## Return the mean of the pollutant across all monitors list
        ## in the 'id' vector (ignoring NA values)
        }

An example output of this function is shown here:

source("pollutantmean.R")
pollutantmean("specdata", "sulfate", 1:10)

## [1] 4.064

pollutantmean("specdata", "nitrate", 70:72)

## [1] 1.706

pollutantmean("specdata", "nitrate", 23)

## [1] 1.281

I can read the whole thing in one go by:

path = "C:/Users/Sean/Documents/R Projects/Data/specdata"
fileList = list.files(path=path,pattern="\\.csv$",full.names=T)
all.files.data = lapply(fileList,read.csv,header=TRUE)
DATA = do.call("rbind",all.files.data)

My issue are:

User enters id either atomic or in a range e.g. suppose user enters 1 but the file name is 001.csv or what if user enters a range 1:10 then file names are 001.csv ... 010.csv
Column is enetered by user i.e. "sulfate" or "nitrate" which he/she is interested in getting the mean of...There are alot of missing values in these columns (which i need to omit from the column before calculating the mean.

The whole data from all the files look like this :

summary(DATA)
         Date           sulfate          nitrate             ID       
 2004-01-01:   250   Min.   : 0.0     Min.   : 0.0     Min.   :  1.0  
 2004-01-02:   250   1st Qu.: 1.3     1st Qu.: 0.4     1st Qu.: 79.0  
 2004-01-03:   250   Median : 2.4     Median : 0.8     Median :168.0  
 2004-01-04:   250   Mean   : 3.2     Mean   : 1.7     Mean   :164.5  
 2004-01-05:   250   3rd Qu.: 4.0     3rd Qu.: 2.0     3rd Qu.:247.0  
 2004-01-06:   250   Max.   :35.9     Max.   :53.9     Max.   :332.0  
 (Other)   :770587   NA's   :653304   NA's   :657738

Any idea how to formulate this would be highly appreciated...

Cheers

I'm struggling through this assignment right now. VERY poor instructional design from the JHU guys. The first assignment should be one that works with one CSV file...let the student build skills gradually. Combining multiple CSV files should be assignment 2, or 3. Had to vent. — Paulb, Oct 16 '15 at 12:31

score 4 · Answer 1 · answered May 14 '14 at 05:14

4

So, you can simulate your situation like this;

# Simulate some data:
# Create 332 data frames
set.seed(1)
df.list<-replicate(332,data.frame(sulfate=rnorm(100),nitrate=rnorm(100)),simplify=FALSE)
# Generate names like 001.csv and 010.csv
file.names<-paste0('specdata/',sprintf('%03d',1:332),'.csv')
# Write them to disk
invisible(mapply(write.csv,df.list,file.names))

And here is a function that would read those files:

pollutantmean <- function(directory, pollutant, id = 1:332) {
  file.names <- list.files(directory)
  file.numbers <- as.numeric(sub('\\.csv$','', file.names))
  selected.files <- na.omit(file.names[match(id, file.numbers)])
  selected.dfs <- lapply(file.path(directory,selected.files), read.csv)
  mean(c(sapply(selected.dfs, function(x) x[ ,pollutant])), na.rm=TRUE)
}

pollutantmean('specdata','nitrate',c(1:100,141))
# [1] -0.005450574

answered May 14 '14 at 05:14

nograpes

18,623
1
44
67

This works well for single args but gives error if I do: pollutantmean(path,"nitrate",70:72) [1] NA Warning message: In mean.default(c(sapply(selected.dfs, function(x) x[, pollutant])), : argument is not numeric or logical: returning NA – Shery May 14 '14 at 09:13
That code works in the example.. Are you sure there isn't something special about `70.csv`-`72.csv`? – nograpes May 14 '14 at 10:31
@nograpes : The solution needs a slight modification. We need to unlist the output of sapply. So it would look like this : ''' selected.dfs <- lapply(file.path(directory,selected.files), read.csv) e <- sapply(selected.dfs, function(x) x[ ,pollutant]) n<-unlist(e) mean(n, na.rm = TRUE) ''' – Shagun Sodhani Jun 07 '14 at 07:05

score 3 · Answer 2 · answered Aug 19 '15 at 22:14

Here is a solution that even your grandmother could understand:

pollutantmean <- function(directory, pollutant, id = 1:332) {

  # Break this function up into a series of smaller functions
  # that do exactly what you expect them to. Your friends
  # will love you for it.

  csvFiles = getFilesById(id, directory)

  dataFrames = readMultipleCsvFiles(csvFiles)

  dataFrame = bindMultipleDataFrames(dataFrames)

  getColumnMean(dataFrame, column = pollutant)
}


getFilesById <- function(id, directory = getwd()) {
  allFiles = list.files(directory)
  file.path(directory, allFiles[id])
}

readMultipleCsvFiles <- function(csvFiles) {
  lapply(csvFiles, read.csv)
}

bindMultipleDataFrames <- function(dataFrames) {
  Reduce(function(x, y) rbind(x, y), dataFrames)
}

getColumnMean <- function(dataFrame, column, ignoreNA = TRUE) {
  mean(dataFrame[ , column], na.rm = ignoreNA)
}

samhiggins2001 · Answer 3 · 2014-05-14T07:09:05.753

User enters id either atomic or in a range e.g. 
suppose user enters 1 but the file name is 001.csv or what if user enters a range 1:10 then file names are 001.csv ... 010.csv

You could use a regular expression and the gsub function to remove leading zeros from the file names, then make a dictionary (in r, a named vector) to convert the modified/gsub'd file names to the actual file names. Ex: if your file names are in a character vector, fnames

fnames = c("001.csv","002.csv")
names(fnames) <- gsub(pattern="^[0]*", replacement="", x=fnames)

With this, the vector fnames is converted to a dictionary, letting you call up the file named 001.csv with something along the lines of fnames["1.csv"]. You can also use gsub() to remove the .csv part of the file name.

Column is enetered by user i.e. "sulfate" or "nitrate" which he/she is interested in getting the mean of...There are alot of missing values in these columns (which i need to omit from the column before calculating the mean.

Many R functions have an option for ignoring the special character indicating a missing value. Try entering help(mean) at the R command prompt to find information on this functionality.

score 2 · Accepted Answer · answered May 14 '14 at 11:28

That's the way I fixed it:

pollutantmean <- function(directory, pollutant, id = 1:332) {
    #set the path
    path = directory

    #get the file List in that directory
    fileList = list.files(path)

    #extract the file names and store as numeric for comparison
    file.names = as.numeric(sub("\\.csv$","",fileList))

    #select files to be imported based on the user input or default
    selected.files = fileList[match(id,file.names)]

    #import data
    Data = lapply(file.path(path,selected.files),read.csv)

    #convert into data frame
    Data = do.call(rbind.data.frame,Data)

    #calculate mean
    mean(Data[,pollutant],na.rm=TRUE)

    }

The last question is that my function should call "specdata" (the directory name where all the csv's are located) as the directory, is there a directory type object in r?

suppose i call the function as:

pollutantmean(specdata, "niterate", 1:10)

It should get the path of specdata directory which is on my working directory... how can I do that?

you can use setwd to change the path to point to your working directory containing the specdata folder first. — melaos, Sep 29 '16 at 15:11

Rich Scriven · Answer 5 · 2014-05-16T13:03:07.723

Here's a somewhat general function for calculating the mean for a specific column over a list of files. Not sure how id should be set up, but right now it acts as an indexing vector (i.e. id = 1:3 calculates the mean for the first three files in the file list).

multifile.means <- function(directory = getwd(), pollutant, id = NULL)
{
    d <- match.arg(directory, list.files())
    cn <- match.arg(pollutant,  c('sulfate', 'nitrate'))
    ## get a vector of complete file paths in the given 'directory'
    p <- dir(d, full.names = TRUE)
    ## subset 'p' based on 'id' values
    if(!is.null(id)){
        id <- id[!id > length(p)]
        p <- p[id]
    }
    ## read, store, and name the relevant columns
    cl <- sapply(p, function(x){ read.csv(x)[,cn] }, USE.NAMES = FALSE)
    colnames(cl) <- basename(p)
    ## return a named list of some results
    list(values = cl, 
         mean = mean(cl, na.rm = TRUE), 
         colMeans = colMeans(cl, na.rm = TRUE))
}

Take it for a test-drive:

> multifile.means('testDir', 'sulfate')
# $values
#      001.csv 057.csv 146.csv 213.csv
# [1,]       5      10      NA       9
# [2,]       1       1      10       3
# [3,]      10       4      10       2
# [4,]       3      10       9      NA
# [5,]       4       1       5       5

# $mean
# [1] 5.666667

# $colMeans
# 001.csv 057.csv 146.csv 213.csv 
#    4.60    5.20    8.50    4.75

score 1 · Answer 6 · answered Aug 17 '16 at 14:38

The selected answer looks good but here's an alternative. This answer works well for the basics covered by the JHU course.

pollutantmean <- function(directory, pollutant, id = 1:332) {
    csvfiles <- dir(directory, "*\\.csv$", full.names = TRUE)
    data <- lapply(csvfiles[id], read.csv)
    numDataPoints <- 0L
    total <- 0L
    for (filedata in data) {
        d <- filedata[[pollutant]] # relevant column data
        d <- d[complete.cases(d)] # remove NA values
        numDataPoints <- numDataPoints + length(d)
        total <- total + sum(d)
    }
    total / numDataPoints
}

csvfiles <- dir (directory, pattern= "*.csv", full.names = TRUE) — Farah Nazifa, Oct 21 '18 at 05:17

score 1 · Answer 7 · answered Oct 12 '17 at 20:34

It took me a couple of hours to work this out, but here is my (shorter) version

pollutmean<- function(dir, pollutant, id=1:332) {
  dir<- list.files(dir, full.names = T)     #list files
  dat<- data.frame()                        #make empty df
  for (i in id) {
    dat <- rbind(dat, read.csv(dir[i]))     #rbind all files
  }
  mean(dat[,pollutant], na.rm = TRUE)       #calculate mean of given column
}

pollutmean("assign/specdata", "sulfate", id=1:60)

score -1 · Answer 8 · answered Oct 14 '16 at 17:31

I was reading the course as well, and came up with the following solution:

pollutantmean <- function(directory="d:/dev/r/documents/specdata",       pollutant, 
                      id)   {
myfilename = paste(directory,"/",formatC(id, width=3, flag="0"),".csv",
                   sep="")
master = lapply(myfilename, read.table, header=TRUE, sep=",")
masterfile = do.call("rbind", master)
head(masterfile[[2]], 100)

if (pollutant == "sulfate") {
    #result=lapply(masterfile[[2]], mean, na.rm=TRUE)
    result=mean(masterfile[[2]], na.rm=TRUE)  

}
if (pollutant == "nitrate") {
    result=mean(masterfile[[3]], na.rm=TRUE)

}
result
}

Reading multiple files and calculating mean based on user input

8 Answers8

Linked

Related