3

I am very new to R and programming in general and need help with lining up data contained in ~2000 .xls and .xlsx files. Each file begins with a range of 34 - 40 rows of "junk" text before the header; all data under the header has the same dimensions.

The 1st method I tried added data to a list; the vertical format was not useful.

library(readxl)
file.list <- list.files(pattern='*.xls')
dm.list <- lapply(file.list, read_excel)

I am currently attempting to read in one file at a time, delete "junk" text, and write to a .csv file (appending data column-wise).

library(readxl)
file.list <- list.files(pattern='*.xls')

for(i in 1:dim.data.frame(file.list))

store.matrix <-  read_excel((paste0("C:\\Users\\jlmine\\Desktop\\qPCRextData\\", file.list[i])), sheet = "Results")

while (store.matrix[1,1] != "Well") #search for header
{  store.matrix <- store.matrix[-c(1)] } #delete non-header rows

write.csv(store.matrix, file = "qPCRdataanalysis.csv", append = TRUE)

The following line is throwing an error:

store.matrix <- read_excel((paste0("C:\\Users\\jlmine\\Desktop\\qPCRextData\\", file.list[i])), sheet = "Results")

Error: 'C:\Users\jlmine\Desktop\qPCRextData\' does not exist. In addition: Warning message: In 1:dim.data.frame(file.list) :
numerical expression has 2 elements: only the first used

"C:\Users\jlmine\Desktop\qPCRextData\" was set as my working directory Any ideas would be greatly appreciated.

Fadhly Permata
  • 1,686
  • 13
  • 26
  • Have you already tried the openxlsx package? It offers a lot of solutions for problems like yours – Ferdi Sep 22 '16 at 06:31
  • I haven't looked into that package yet...i'll check it out if I have troubles with the below solutions/suggestions...thanks for the recommendation! – Jerry Miner Sep 22 '16 at 15:42

2 Answers2

0

Without having access to your .xlsx file, the problem appears to be in your for loop statement. list.files returns a character vector of the files in the specified directory. Using dim.data.frame on a vector x with length 5 would give you the result:

#[0, 5]

And from your warning message, you know only the first element is being used in the for loop. So you aren't looping over anything.

So if you want to loop through all the files more elegantly, you'd do...

for (i in seq(length(file.list)) {
Chrisss
  • 3,211
  • 1
  • 16
  • 13
0

I can't be sure without seeing some of your data, but it looks like you could read in each file, find the row where the "real" data begins and then remove the "junk" rows. For example:

df.list = lapply(file.list, function(f) {

  # Read file
  tmp = read_excel(f, sheet="Results")

  # Find highest index of row containing "Well" and add 1 (assuming here
  # that a row containing "Well" will come before the header row).
  s = which(apply(tmp, 1, function(x) {grep("Well", x)}) > 0)
  s = ifelse(length(s) > 0, max(s) + 1, 0)

  # Reset column names to the values in row s (the actual header row)
  # Remove rows 1 through s (the "junk" text plus the header row) from the data frame
  if(s > 0) {
    names(tmp) = tmp[s, ]
    tmp[-(1:s), ]
  }

})

You'll now have df.list, a list in which each element is one of the xls/xlsx files you just loaded. You said you want to combine the data column-wise, but if each data frame has the same columns, wouldn't you want to stack the data frames. To do that, you could do:

df.list = do.call(rbind, df.list)

You now have a single data frame, which you can save as a csv file.

eipi10
  • 91,525
  • 24
  • 209
  • 285