0

I have more than 300 csv files in a directory. The csv files have a following structure

id              Date        Nitrate     Sulfate
id of csv file  Some date   Some Value  Some Value
id of csv file  Some date   Some Value  Some Value
id of csv file  Some date   Some Value  Some Value

I want to count number of row in each csv file excluding the NA in that file and stored it in dataframe which has two columns: (1) id & (2) nobs.

Here is my code for that:

complete <-function(directory,id){
  filenames <-sprintf("%03d.csv", id)
  filenames <-paste(directory,filenames,sep = '/')
  dataframe <-data.frame(id=numeric(0),nobs=numeric(0))
  for(i in filenames){
    data <- read.csv(i)
    dataframe[i,dataframe$id]<-data[data$id]
    dataframe[i,dataframe$nobs]<-nrow(data[!is.na(data$sulfate & data$nitrate),])
  }

  dataframe

}

The problem arises when I try to populate dataframe inside the loop, it seems like it is not populating the data frame and returning me NULL. I know that I am doing something stupid.

Jaap
  • 81,064
  • 34
  • 182
  • 193
Farrukh Ahmed
  • 473
  • 4
  • 11
  • 26
  • did you check this post ? http://stackoverflow.com/questions/14358629/counting-the-number-of-rows-of-a-series-of-csv-files –  Feb 28 '16 at 10:01

2 Answers2

3

I usually prefer to add the rows into a pre-allocated list then bind them together. Here's a working example :

##### fake read.csv function returning random data.frame 
# (just to reproduce your case, remove this from your code...)
read.csv <- function(fileName){
  stupidHash <- sum(as.integer(charToRaw(fileName)))
  if(stupidHash %% 2 == 0){
    return(data.frame(id=stupidHash,date='2016-02-28',
                      nitrate=c(NA,2,3,NA,5),sulfate=c(10,20,NA,NA,40)))
  }else{
    return(data.frame(id=stupidHash,date='2016-02-28',
                      nitrate=c(4,2,3,NA,5,9),sulfate=c(10,20,NA,NA,40,50)))
  }
}
#####

complete <-function(directory,id){
  filenames <-sprintf("%03d.csv", id)
  filenames <-paste(directory,filenames,sep = '/')
  # here we pre-allocate a list of lenght=length(filenames)
  # where we will put the rows of our future data.frame
  rowsList <- vector(mode='list',length=length(filenames)) 
  for(i in 1:length(filenames)){
    filename <- filenames[i]
    data <- read.csv(filename)
    rowsList[[i]] <- data.frame(id=data$id[1],
                                nobs=sum(!is.na(data$sulfate) & !is.na(data$nitrate)))
  }
  # here we bind all the previously created rows together into one data.frame
  DF <- do.call(rbind.data.frame, rowsList)
  return(DF)
}

Usage example :

res <- complete(directory='dir',id=1:3)

> res
   id nobs
1 889    4
2 890    2
3 891    4
digEmAll
  • 56,430
  • 9
  • 115
  • 140
2

The problem is in these 2 lines:

dataframe[i,dataframe$id]<-data[data$id]
dataframe[i,dataframe$nobs]<-nrow(data[!is.na(data$sulfate & data$nitrate),])

If you want to extend a dataframe, please use rbind function. But please be aware of that it is not effective way, because it allocate new memory and copy all data and add one new row. The effective way is to allocate dataframe big enough in this line:

dataframe <-data.frame(id=numeric(0),nobs=numeric(0))

Instead of 0, use number of expected number of rows.

So the easiest way is to

dataframe <- rbind(dataframe, data.frame(id=data$id[1], nobs=nrow(data[!is.na(data$sulfate) & !is.na(data$nitrate),]))

More effective way is something like that:

dataframe <-data.frame(id=numeric(numberOfRows),nobs=numeric(numberOfRows))

and after that in loop:

dataframe[i,]$id<-data$id[1]
dataframe[i,]$nobs<-nrow(data[!is.na(data$sulfate) & !is.na(data$nitrate),])

UPDATE: I changed values you used to populate dataframe to data$id[1] and nrow(data[!is.na(data$sulfate) & !is.na(data$nitrate),])

bartoszukm
  • 693
  • 3
  • 10
  • @bartoszukum I tried your approach but it ends with this error. Error in `$<-.data.frame`(`*tmp*`, "id", value = list()) : replacement has 1461 rows, data has 1 – Farrukh Ahmed Feb 28 '16 at 10:18
  • @Farrukh Ahmed Sorry, I assumed values you use to populate dataframe are valid. Please consider it now. I assume that data$id has the same values in every position, so I use first. To use is.na, you have to check it for every column separately and make logical 'and' after that. – bartoszukm Feb 28 '16 at 10:36