5

I am engaged in data cleaning. I have a function that identifies bad rows in a large input file (too big to read at one go, given my ram size) and returns the row numbers of the bad rows as a vector badRows. This function seems to work.

I am now trying to read just the bad rows into a data frame, so far unsuccessfully.

My current approach is to use read.table on an open connection to my file, using a vector of the number of rows to skip between each row that is read. This number is zero for consecutive bad rows.

I calculate skipVec as:

(badRowNumbers - c(0, badRowNumbers[1:(length(badRowNumbers-1]))-1

But for the moment I am just handing my function a skipVec vector of all zeros.

If my logic is correct, this should return all the rows. It does not. Instead I get an error:

"Error in read.table(con, skip = pass, nrow = 1, header = TRUE, sep = "") : no lines available in input"

My current function is loosely based on a function by Miron Kursa ("mbq"), which I found here.

My question is somewhat duplicative of that one, but I assume his function works, so I have broken it somehow. I am still trying to understand the difference between opening a file and opening a connection to a file, and I suspect that the problem is there somewhere, or in my use of lapply.

I am running R 3.0.1 under RStudio 0.97.551 on a cranky old Windows XP SP3 machine with 3gig of ram. Stone Age, I know.

Here is the code that produces the error message above:

# Make a small small test data frame, write it to a file, and read it back in 
# a row at a time.
testThis.DF <- data.frame(nnn=c(2,3,5), fff=c("aa", "bb", "cc"))  
testThis.DF 

# This function will work only if the number of bad rows is not too big for memory
write.table(testThis.DF, "testThis.DF")
con<-file("testThis.DF")
open(con)
skipVec <- c(0,0,0)
badRows.DF  <- lapply(skipVec, FUN=function(pass){
  read.table(con, skip=pass, nrow=1, header=TRUE, sep="") })
close(con)

The error occurs before the close command. If I yank the readLines command out of the lapply and the function and just stick it in by itself, I still get the same error.

epo3
  • 2,991
  • 2
  • 33
  • 60
andrewH
  • 2,281
  • 2
  • 22
  • 32

1 Answers1

5

If instead of running read.table through lapply you just run the first few iterations manually, you will see what is going on:

> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
  nnn fff
1   2  aa
> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
  X2 X3 bb
1  3  5 cc

Because header = TRUE it is not one line that is read at each iteration but two, so you eventually run out of lines faster than you think, here on the third iteration:

> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
Error in read.table(con, skip = 0, nrow = 1, header = TRUE, sep = "") : 
  no lines available in input

Now this might still not be a very efficient way of solving your problem, but this is how you can fix your current code:

write.table(testThis.DF, "testThis.DF")
con <- file("testThis.DF")
open(con)
header <- scan(con, what = character(), nlines = 1, quiet = TRUE)
skipVec <- c(0,1,0)
badRows <- lapply(skipVec, function(pass){
  line <- read.table(con, nrow = 1, header = FALSE, sep = "",
                     row.names = 1)
  if (pass) NULL else line
  })
badRows.DF <- setNames(do.call(rbind, badRows), header)
close(con)

Some clues towards higher speeds:

  1. use scan instead of read.table. Read data as character and only at the end, after you have put your data into a character matrix or data.frame, apply type.convert to each column.
  2. Instead of looping over skipVec, loop over its rle if it is much shorter. So you'll be able to read or skip chunks of lines at a time.
flodel
  • 87,577
  • 21
  • 185
  • 223
  • Dear @flodel- This code doesn't work, but I think this is just a typo, because it works for skipVec=0,0,0 instead of 0,1,0. Though it does work, I am embarrassed to admit that I am missing some of the logic. First, I don't know why you scan the 1st line and use read.table for the rest. But my bigger confusion, perhaps because I don't understanding how files & connections are processed differently, is that I don't see why skipVec=0,0,0 doesn't start at the 1st line. Also, I don't know what an rle is. I've read the function doc, but I don't know what a run is or a run length, or how to find out. – andrewH Oct 08 '13 at 03:10
  • *This code doesn't work* is not very helpful to me unless you tell me why it does not work. Unexpected output, error message? Note that it does work with your example so maybe your real data file has a different format than the one you are making us build in this example. – flodel Oct 08 '13 at 10:38
  • A connection is like a file that is kept open, so it can be accessed multiple times for reading or writing. The connection includes a pointer to where the data was last read or written. When you first open the file, the file handle is pointing to the beginning of the file. After you use `read.table(con, skip=0, nrow=1, header=TRUE, sep="")` the first time, the header and the first data row of your file are read so the pointer is now pointing to the beginning of the second data row. – flodel Oct 08 '13 at 10:52
  • Dear @flodel - By "doesn't work" I only meant that, as I stated, I was trying to get it to return all the rows as bad. Your code returns the first and third rows of my toy example. I spent a lot of time trying to figure out why you had changed my rows parameter, until I realized that your code returned the right answer for the skipRows you gave me - the parameter was wrong, but the algorithm right. That is why I said I thought it was just a typo. If you agree, say so or edit the 1 to a 0 & I'll uptick the answer. I would still be grateful to know why you used scan for for the 1st line. – andrewH Oct 08 '13 at 15:48
  • I see now. It was not a typo: I did change the value of `skipVec` on purpose. I thought having a mix of good and bad rows was a more representative example of your general problem: it also showed that my algorithm was properly keeping or rejecting rows. Otherwise if `skipVec` only contains zeroes, one could argue that you can just run `read.table("testThis.DF", header = TRUE)` and done. Sorry I did not make that clear. – flodel Oct 08 '13 at 21:11
  • For the header, I preferred to use `scan` because it returns a vector of characters, while `read.table()` returns a data.frame. Not a huge difference here. – flodel Oct 08 '13 at 21:20