2

I m looking for a better way to open all xlsx and csv files of a repository at a time and then merge them. Here is what I have done, but as you will see there are some manual actions that I m trying to automate. Can you please help me?

  1. Define the Working directory

    setwd("~/Desktop/repo/")
    list.files()
    
    [1] "dataset1.csv"      
    [2] "dataset2.csv" 
    [3] "dataset3.csv"    
    [4] "dataset4.csv"        
    [5] "export.xlsx"                              
    [6] "export1.xlsx"                           
    [7] "export2.xlsx"                           
    
  2. Open all CSV files (this part seems OK to me, as each dataset opens automatically)

    file.list <- list.files(pattern='*.csv')
    csv_df <- lapply(file.list, read.csv, header=TRUE)
    csv_df <- do.call("rbind", csv_df) #open all csv in a single dataframe
    
  3. Open all XLSX files (here comes the most ugly manually part)

    df1 <- lapply(excel_sheets("export.xlsx"), read_excel, path = "export.xlsx")
    df2 <- lapply(excel_sheets("export1.xlsx"), read_excel, path = "export(1).xlsx")
    df3 <- lapply(excel_sheets("export2.xlsx"), read_excel, path = "export(2).xlsx")
    df1 <- do.call("rbind", df1)
    df2 <- do.call("rbind", df2)
    df3 <- do.call("rbind", df3)
    xlsx_df <- rbind(df1, df2, df3) #merge all xlsx
    
  4. Merge everything in a big dataset

    big_df <- rbind(csv_df, xlsx_df)
    

Any suggestions ? How can I improve this script (specially the 3. part)?

Remi
  • 961
  • 1
  • 13
  • 25
  • You could use the `rio` package which leverages many other packages to provide a broad range of file ingest from a single function. Then it would resolve down to a single `lapply` and `do.call(rbind, x)`, or a `map_df` if you want to use `purrr` – Jake Kaupp Nov 13 '17 at 21:26
  • merge (column bind) != append (row bind) – Parfait Nov 13 '17 at 22:59

2 Answers2

0

For reading in Excel files other than .csv files, I recommend package XLConnect, which can read in all kinds of Excel files (.xlsx, .xls,.xlsm). I made some minor edits to your 2nd part and this should work (assuming all your column names are the same in each .xlsx file):

library(XLConnect)

list.files()
file.list <- list.files(pattern='*.xlsx')
xlsx_df <- lapply(file.list, readWorksheetFromFile, header = T)
xlsx_df <- do.call("rbind", xlsx_df)

If you cannot install the XLConnect package, try to install package openxlsx which does not require java. The code would be like this:

library(openxlsx)

list.files()
file.list <- list.files(pattern='*.xlsx')
xlsx_df <- lapply(file.list, read.xlsx, header = T)
xlsx_df <- do.call("rbind", xlsx_df)

Edits: To read in all sheets from a Excel file: I borrowed a function created in this post and make it a openxlsx version:

library(openxlsx)    
read_excel_allsheets <- function(filename) {
  sheets <- names(loadWorkbook(filename))
  x <- lapply(sheets, function(X) readWorkbook(filename, sheet = X))
  names(x) <- sheets
  x
}

data <- read_excel_allsheets("file.xlsx")
Davidx
  • 61
  • 8
  • Thanks for your reply. It seems there is an issue with the XLConnect package. I can't use it on my laptop (R Studio Version 1.0.143 / R 3.4.0). I have this error message : "Error: package or namespace load failed for ‘XLConnectJars". I also tried to manuall install.packages("XLConnectJars") and install.packages("rJava") but the issue is still there. Do you have the issue too? – Remi Nov 13 '17 at 22:14
  • @Remi I guess you are using Mac OSX system. Here are two links you may find helpful. 1. https://stackoverflow.com/questions/15303967/installing-r-library-xlconnect-on-os-10-8-2 2. https://cran.r-project.org/web/packages/XLConnect/README.html – Davidx Nov 14 '17 at 14:47
  • Thanks Davidx, checked both 2 links and followed instructions but I still have the issue.... and error message "Error: package or namespace load failed for ‘XLConnectJars’:" :-/ – Remi Nov 14 '17 at 16:22
  • @Remi See my edits to the answer above. I think the general logic is the same. It comes down to which function you use in the `lapply` function to read in your excel file. – Davidx Nov 14 '17 at 16:41
  • I tryied too but library(xlsx) do need rJava.... When loading xlsx, I got this message : Error: package or namespace load failed for ‘rJava’: – Remi Nov 15 '17 at 10:10
  • Is there an other way to bulk open XLSX files ? (without using XLConnect or XLSX libraries) ? – Remi Nov 15 '17 at 13:34
  • @Remi try the package `openxlsx` and the function is also `read.xlsx` – Davidx Nov 16 '17 at 14:43
  • Thank you very much @Davidx, it works... almost :-) The only thing that last is : how to open all sheets inside each excel file? I was using excel_sheets() in my previous script, but I don't really see how to include it in your update. Do you have any idea? – Remi Nov 16 '17 at 16:13
  • @Remi I am now sure what exactly are you asking for. Maybe you can start another question. If you want to read all the sheet in the your file, you can write a `for` loop to read one sheet at a time. With in the `read.xlsx`, you can set `sheet = 1` or `sheet = Book1`. – Davidx Nov 16 '17 at 18:54
  • Yes, the goal is to read all the sheets, in each xlsx file. Do you know how to create the for loop please ? – Remi Nov 16 '17 at 19:07
  • @Remi I did some research and found this function. See my edits above. If this solves your problem, please mark my post as answer. Thank you. – Davidx Nov 16 '17 at 19:37
0

This is a quite old post but might be still relevant for some folks.

Hello, I had somewhat a similar problem, your "rbinding" of the csv files seems ok, but I would like to suggest a function to bind your excel files (step 2 of your post), for this you will require the readxl package, its a very basic package to import xls and xlsx files:

library(readxl)
rbind.all.excel<-function(dir,skip){
  file_list <- list.files(dir)
  df <- do.call("rbind",lapply(file_list,FUN=function(files){
    read_excel(files,skip=skip,col_types = "guess")}))
  return(df)
}

Then:

dir<-c("your/target/path")
big.xl<-rbind.all.excel(dir=dir,skip=0)

And you are done with the excel files