3

I have a large number of csv files that I want to read into R. All the column headings in the csvs are the same. But I want to import only those rows from each file into the data frame for which a variable is within a given range (above min threshold & below max threshold), e.g.

   v1   v2   v3
1  x    q    2
2  c    w    4
3  v    e    5
4  b    r    7

Filtering for v3 (v3>2 & v3<7) should results in:

   v1   v2   v3
1  c    w    4
2  v    e    5

So fare I import all the data from all csvs into one data frame and then do the filtering:

#Read the data files
fileNames <- list.files(path = workDir)
mergedFiles <- do.call("rbind", sapply(fileNames, read.csv, simplify = FALSE))
fileID <- row.names(mergedFiles)
fileID <- gsub(".csv.*", "", fileID)
#Combining data with file IDs
combFiles=cbind(fileID, mergedFiles)
#Filtering the data according to criteria
resultFile <- combFiles[combFiles$v3 > min & combFiles$v3 < max, ]

I would rather apply the filter while importing each single csv file into the data frame. I assume a for loop would be the best way of doing it, but I am not sure how. I would appreciate any suggestion.

Edit

After testing the suggestion from mnel, which worked, I ended up with a different solution:

fileNames = list.files(path = workDir)
mzList = list()
for(i in 1:length(fileNames)){
tempData = read.csv(fileNames[i])
mz.idx = which(tempData[ ,1] > minMZ & tempData[ ,1] < maxMZ)
mz1 = tempData[mz.idx, ]
mzList[[i]] = data.frame(mz1, filename = rep(fileNames[i], length(mz.idx)))
}
resultFile = do.call("rbind", mzList)

Thanks for all the suggestions!

D. Sempere
  • 33
  • 1
  • 4
  • Why do you want to filter while importing? Do you have speed or memory problems with your current solution? If it's speed then problem is probably the rbind. If it's memory then it might still be the rbind, but then filtering first might help. – Jan van der Laan Apr 09 '13 at 09:18
  • @Jan van der Laan - Yes the problem is memory...I am running out of it, which is why filtering out what is not important should be the way to go. – D. Sempere Apr 10 '13 at 20:59
  • I suggest you look into `sqldf` package, it should help you with that. – Dominic Comtois Mar 29 '15 at 06:12

3 Answers3

3

Here is an approach using data.table which will allow you to use fread (which is faster than read.csv) and rbindlist which is a superfast implementation of do.call(rbind, list(..)) perfect for this situation. It also has a function between

library(data.table)
fileNames <- list.files(path = workDir)
alldata <- rbindlist(lapply(fileNames, function(x,mon,max) {
  xx <- fread(x, sep = ',')
  xx[, fileID :=   gsub(".csv.*", "", x)]
  xx[between(v3, lower=min, upper = max, incbounds = FALSE)]
  }, min = 2, max = 3))

If the individual files are large and v1 always integer values it might be worth setting v3 as a key then using a binary search, it may also be quicker to import everything and then run the filtering.

Community
  • 1
  • 1
mnel
  • 113,303
  • 27
  • 265
  • 254
  • Thanks for your suggestion, I tried it and it was indeed faster. At the end I ended up with a different solution though, see above. – D. Sempere Apr 10 '13 at 21:02
  • @mnel: small typo: it looks like your "mon" should read "min", on the 3rd line. – val Aug 27 '18 at 18:10
2

If you want to do "filtering" before importing the data try to use read.csv.sql from sqldf package

Nishanth
  • 6,932
  • 5
  • 26
  • 38
0

If you are really stuck for memory then the following solution might work. It uses LaF to read only the column needed for filtering; then calculates the total number of lines that will be read; initialized the complete data.frame and then read the required lines from the files. (It's probably not faster than the other solutions)

library("LaF")

colnames <- c("v1","v2","v3")
colclasses <- c("character", "character", "numeric")

fileNames <- list.files(pattern = "*.csv")

# First determine which lines to read from each file and the total number of lines
# to be read
lines <- list()
for (fn in fileNames) {
  laf <- laf_open_csv(fn, column_types=colclasses, column_names=colnames, skip=1)
  d   <- laf$v3[] 
  lines[[fn]] <- which(d > 2 & d < 7)
}
nlines <- sum(sapply(lines, length))

# Initialize data.frame
df <- as.data.frame(lapply(colclasses, do.call, list(nlines)), 
        stringsAsFactors=FALSE)
names(df) <- colnames

# Read the lines from the files
i <- 0
for (fn in names(lines)) {
  laf <- laf_open_csv(fn, column_types=colclasses, column_names=colnames, skip=1)
  n   <- length(lines[[fn]])
  df[seq_len(n) + i, ] <- laf[lines[[fn]], ]
  i   <- i + n
}
Jan van der Laan
  • 8,005
  • 1
  • 20
  • 35