0

I'm pretty new to R, but it seems that this is a specific problem to which I have not been able to find an answer.

My program reads in some data, then rbinds certain columns of that data to one of several data frames based on a vector of column numbers I pass it, so something like this:

filename <- c("vector", "full", "of", "filenames")
colVal <- (32)    
InMat <- data.frame()
for (i in 1:length(filename)){
  file <- read.table(filename[i], header=TRUE, fill=TRUE, stringsAsFactors=FALSE)
  InMat <- rbind(InMat, file[c(2:dim(file)[1], colVal)])
  #...other matricies...
}

My issue lies in the case where there is only one desired column, i.e. colVal takes one value. In this case, I find that InMat is essentially transposed from what I would require. Worse, when I read in mulitple files, it rbinds the transposed desired column, so I get a number of rows equal to the number of files I'm reading, with as many columns as there are rows in each desired column of each file.

It seems that if there are 2 desired columns (i.e. colVal takes two or more values), then it acts as I expect (i.e. a column is read and stored in InMat as a column, columns from each additional file are stored below).

My question is why does rbind act differently when only one desired column value is passed to it, and if there is an easy way (read; not adding some clunky if or for loop to check) to avoid this?

Thanks!

2 Answers2

1

Short answer: [.data.frame (the [ operator on data frames) by default converts output to the lowest possible dimension (via the argument drop=TRUE). If you're pulling just one column then it converts to a vector, which then creates a matrix with other vectors via rbind into a matrix. When you extract two or more columns, you get a data frame, so the output of rbind is a data frame.

The quick fix is to change this line:

InMat <- rbind(InMat, file[c(2:dim(file)[1], colVal)]) #old line
InMat <- rbind(InMat, file[c(2:dim(file)[1], colVal),drop=FALSE]) #new line

A more R-like way of coding this would be to use lapply and call rbind once. Because R is assign-by-copy, growing objects by repeated concatenating/adding is quite inefficient (see the second circle of the R Inferno).

filename <- c("vector", "full", "of", "filenames")
colVal <- (32)    
dfm <- lapply(filename, read.table
  , header=TRUE, fill=TRUE, stringsAsFactors=FALSE)
dfm <- lapply(dfm,`[`,colVal)
dfm <- do.call(rbind,dfm)

If you know the positions of the columns you want to extract beforehand, you could use the colClasses argument of read.table and skip over reading the entire table:

filename <- c("vector", "full", "of", "filenames")
colVal <- 32
cc <- rep.int("NULL",40) #where 40 is # of columns in table
cc[colVal] <- NA 
dfm <- lapply(filename, read.table
  , header=TRUE, fill=TRUE, colClasses=cc, stringsAsFactors=FALSE)
dfm <- do.call(rbind,dfm)
Blue Magister
  • 13,044
  • 5
  • 38
  • 56
  • Seems to work well, although for some reason makes my code run slower – Janice Vetter Mar 15 '13 at 17:59
  • If I run the code in the R GUI console, it looks as though it slows down at the first call of lapply. It pauses for 2-5 seconds or so. (FYI I'm using your second suggested bit of code). – Janice Vetter Mar 18 '13 at 14:02
  • And it is slower than the first bit of code? I can't tell exactly why that's the case without looking at the data, but specifying the classes of the columns (e.g. `"character"` instead of `NA`) might help. – Blue Magister Mar 18 '13 at 17:34
  • Thanks, I got a lot better results using your third suggestion, the files are over 17000 lines long by 100+ columns in some cases. I suppose in my old for loop a bit less time was lost since the console didn't print stuff all the time. Thanks again for the great advice! – Janice Vetter Mar 18 '13 at 17:47
0

When you take only one column it becomes a vector. It would be better if you just appended all the values into a vector instead of a matrix

InVec <- c()
for (i in 1:length(filename)){
  file <- read.table(filename[i], header=TRUE, fill=TRUE, stringsAsFactors=FALSE)
  InVec <- c(InVec, file[-1, colVal)])
  #...other matricies...
}

Using c() will be much faster than rbind as well

LostLin
  • 7,762
  • 12
  • 51
  • 73
  • I think I'm running into issues here trying to perform actions on these vectors. Since I'm reading in several files, all of which have different lengths, I end up with inconsistent row numbers and cannot print etc... Any thoughts? – Janice Vetter Mar 14 '13 at 17:38
  • what actions are you performing? – LostLin Mar 14 '13 at 17:43
  • some of the matrices require column min/max, others mean, and finally they all need to be written out to a csv file. The data is half-hourly and divided among files by month. So some files are 30*48, 29*48 etc... – Janice Vetter Mar 14 '13 at 18:04
  • I'm not understanding what issues you're having working with a single vector as opposed to a single column – LostLin Mar 14 '13 at 18:40
  • Sorry, the difficulty occurs because it certain situations I may need one column from my input files, or several. I have 4 or so matricies that contain similar data (temperature, percipitation, etc...) pulled from monthly records which contain all of these data. In some cases, I need 7 columns (which are disctinct forms of data, temp@1m, temp@3m, etc...) in a given matrix, in another case, I may need only one column (i.e. wind speed). – Janice Vetter Mar 14 '13 at 18:46