11

I have multiple .xls (~100MB) files from which I would like to load multiple sheets (from each) into R as a dataframe. I have tried various functions, such as xlsx::xlsx2 and XLConnect::readWorksheetFromFile, both of which always run for a very long time (>15 mins) and never finish and I have to force-quit RStudio to keep working.

I also tried gdata::read.xls, which does finish, but it takes more than 3 minutes per one sheet and it cannot extract multiple sheets at once (which would be very helpful to speed up my pipeline) like XLConnect::loadWorkbook can.

The time it takes these functions to execute (and I am not even sure the first two would ever finish if I let them go longer) is way too long for my pipeline, where I need to work with many files at once. Is there a way to get these to go/finish faster?

In several places, I have seen a recommendation to use the function readxl::read_xls, which seems to be widely recommended for this task and should be faster per sheet. This one, however, gives me an error:

> # Minimal reproducible example:
> setwd("/Users/USER/Desktop")
> library(readxl)
> data <- read_xls(path="test_file.xls")
Error: 
  filepath: /Users/USER/Desktop/test_file.xls
  libxls error: Unable to open file

I also did some elementary testing to make sure the file exists and is in the correct format:

> # Testing existence & format of the file
> file.exists("test_file.xls")
[1] TRUE
> format_from_ext("test_file.xls")
[1] "xls"
> format_from_signature("test_file.xls")
[1] "xls"

The test_file.xls used above is available here. Any advice would be appreciated in terms of making the first functions run faster or the read_xls run at all - thank you!

UPDATE (12/14/2019):

It seems that some users are able to open the file above using the readxl::read_xls function, while others are not, both on Mac and Windows, using the most up to date versions of R, Rstudio, and readxl. The issue has been posted on the readxl GitHub and has not been resolved yet.

UPDATE (4/14/2023):

The above GitHub issue became a combination of various problems and yielded one possible solution, which was opening and closing the file in MS Excel first, which subsequently made it readable by readxl::read_xls. Understandably, though, this is not the best solution. The above issue was closed without a better resolution and instead, a related issue was opened in the libxls GitHub, so far without a resolution.

Brunox13
  • 775
  • 1
  • 7
  • 21

8 Answers8

5

I downloaded your dataset and read each excel sheet in this way (for example, for sheets "Overall" and "Area"):

install.packages("readxl")
library(readxl)
library(data.table)

dt_overall <- as.data.table(read_excel("test_file.xls", sheet = "Overall"))
area_sheet <- as.data.table(read_excel("test_file.xls", sheet = "Area"))

Finally, I get dt like this (for example, only part of the dataset for the "Area" sheet):

enter image description here

Just as well, you can use the read_xls function instead read_excel.

I checked, it also works correctly and even a little faster, since read_excel is a wrapper over read_xls and read_xlsx functions from readxl package.

Also, you can use excel_sheets function from readxl package to read all sheets of your Excel file.

UPDATE

Benchmarking is done with microbenchmark package for the following packages/functions: gdata::read.xls, XLConnect::readWorksheetFromFile and readxl::read_excel.

But XLConnect it's a Java-based solution, so it requires a lot of RAM.

enter image description here

red_quark
  • 971
  • 5
  • 20
  • Thanks for the reply! Unfortunately, this does not advance my problem in any way - when I try running your code, I still get the same error as described in the original question: `libxls error: Unable to open file` – Brunox13 Dec 07 '19 at 21:01
  • 2
    Try uninstall `readxl` package, restart the RStudio and reinstalling the `readxl`. Also try updating R and RStudio to the latest versions (if you have outdated versions). – red_quark Dec 07 '19 at 22:39
  • As a possible option, you can open your Excel file, save it with the `.xlsx` extension and then read it with the same way (using function `read_excel` from the `readxl` package). – red_quark Dec 07 '19 at 22:50
  • Good idea - but reinstalling the `readxl` package did not help. If possible, I'd like to open the file as is, all in R, without using Excel first. I have submitted a bug report on [readxl's github](https://github.com/tidyverse/readxl/issues/598), so we'll see if they have a better suggestion! – Brunox13 Dec 09 '19 at 01:12
  • 1
    Oh, that’s very interesting. I hope they help you find a solution to the issue. Simultaneously, I have two more recommendations, that may possibly help you: 1. Try directly install the `Rcpp` package, which in turn requires `Xcode` (for OSX), which are dependencies external to R. – red_quark Dec 09 '19 at 02:45
  • 2
    2. Try open your excel file with older version of `readxl` package (for example, v.1.3.0, 1.2.0, 1.1.0., etc.; all archived versions are available here: https://cran.r-project.org/src/contrib/Archive/readxl/). You can install it in this way (for example): `if(!require(readxl)) { install.packages("https://cran.r-project.org/src/contrib/Archive/readxl/readxl_1.2.0.tar.gz", repos = NULL, type="source"); suppressPackageStartupMessages(require(readxl))}` P.S. in the link you can specify any version of the archive package – red_quark Dec 09 '19 at 02:46
  • Thanks, these are some resourceful ideas! Unfortunately, they have not helped - I already have both Rcpp and Xcode; following your advice, I tried previous versions of the `readxl` package and I'm still getting the same error; the only difference is that v1.1.0's error is `Error in read_fun(path = path, sheet_i = sheet, limits = limits, shim = shim, : Failed to open test_file.xls` but I am assuming it is essentially the same thing. – Brunox13 Dec 09 '19 at 20:45
  • Please, how does the speed of `gdata::read.xls`, `XLConnect::readWorksheetFromFile`, and `readxl::read_xls` compare on your computer for this file? And did you alter (or open in Excel) the file before successfully loading it with the `readxl` package? – Brunox13 Dec 09 '19 at 21:20
  • I did benchmarking for you (according to the functions that you specified, times = 1L). As you can see, the `readxl::read_excel` is much faster than others. I updated my answer to be able to demonstrate benchmarking. – red_quark Dec 09 '19 at 23:35
  • 1
    Of course, no. I didn't alter or open excel file before loading it with the `readxl` package. I run the code on Windows 10 Enterprise (RAM 16 GB, 64-bit). But! When I try to read your excel file with RStudio Server (Ubuntu), I get same error: `libxls error: Unable to open file`. – red_quark Dec 09 '19 at 23:36
  • Thank you so much for doing all of that!! I'll repost your comment on the github. – Brunox13 Dec 10 '19 at 02:45
  • I just tried the `readxl::read_excel` function on a Windows computer and ended with the exact same error - `libxls error: Unable to open file`. What is the setup on the computer on which this function works with this file? Windows, R, RStudio, readxl versions, etc.? – Brunox13 Dec 11 '19 at 18:28
  • 1
    Oh, very strange( I have `> sessionInfo() R version 3.6.1 (2019-07-05) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200)` (Windows: Windows 10 Enterprise, vers. 1803, OS build 17134.1006. Proc.: Intel Core i7-8550U, 64-bit OS). RStudio - Version 1.1.456 (July 19th, 2018) – red_quark Dec 13 '19 at 11:48
1

I found that I was unable to open the file with read_xl immediately after downloading it, but if I opened the file in Excel, saved it, and closed it again, then read_xl was able to open it without issue.

My suggested workaround for handling hundreds of files is to build a little C# command line utility that opens, saves, and closes an Excel file. Source code is below, the utility can be compiled with visual studio community edition.

using System.IO;
using Excel = Microsoft.Office.Interop.Excel;

namespace resaver
{
  class Program
  {
    static void Main(string[] args)
    {
      string srcFile = Path.GetFullPath(args[0]);
      Excel.Application excelApplication = new Excel.Application();
      excelApplication.Application.DisplayAlerts = false;
      Excel.Workbook srcworkBook = excelApplication.Workbooks.Open(srcFile);
      srcworkBook.Save();
      srcworkBook.Close();
      excelApplication.Quit();
    }
  }
}

Once compiled, the utility can be called from R using e.g. system2().

mikeck
  • 3,534
  • 1
  • 26
  • 39
1

On my system, i had to use path.expand.

R> file = "~/blah.xls"
R> read_xls(file)

Error: 
  filepath: ~/Dropbox/signal/aud/rba/balsheet/data/a03.xls
  libxls error: Unable to open file

R> read_xls(path.expand(file)) # fixed
ricardo
  • 8,195
  • 7
  • 47
  • 69
0

I will propose a different workflow. If you happen to have LibreOffice installed, then you can convert your excel files to csv programatically. I have Linux, so I do it in bash, but I'm sure it can be possible in macOS.

So open a terminal and navigate to the folder with your excel files and run in terminal:

for i in *.xls
    do soffice --headless --convert-to csv "$i" 
done

Now in R you can use data.table::fread to read your files with a loop:

Scenario 1: the structure of files is different

If the structure of files is different, then you wouldn't want to rbind them together. You could run in R:

files <- dir("path/to/files", pattern = ".csv")
all_files <- list()
for (i in 1:length(files)){
  fileName <- gsub("(^.*/)(.*)(.csv$)", "\\2", files[i])
  all_files[[fileName]] <- fread(files[i])
}

If you want to extract your named elements within the list into the global environment, so that they can be converted into objects, you can use list2env:

list2env(all_files, envir = .GlobalEnv)

Please be aware of two things: First, in the gsub call, the direction of the slash. And second, list2env may overwrite objects in your Global Environment if they have the same name as the named elements within the list.

Scenario 2: the structure of files is the same

In that case it's likely you want to rbind them all together. You could run in R:

files <- dir("path/to/files", pattern = ".csv")
joined <- list()
for (i in 1:length(files)){
  joined <- rbindlist(joined, fread(files[i]), fill = TRUE)
}
PavoDive
  • 6,322
  • 2
  • 29
  • 55
  • 1
    Thanks for the reply! I would prefer a solution that is entirely done in R, as I am developing this workflow for a friend who is not as well-versed in coding, so I'd like to hand off a single working script to her, instead of having to go to terminal. – Brunox13 Dec 07 '19 at 21:09
0

I was seeing a similar error and wanted to share a short-term solution.

library(readxl)
download.file("https://mjwebster.github.io/DataJ/spreadsheets/MLBpayrolls.xls", "MLBPayrolls.xls")
MLBpayrolls <- read_excel("MLBpayrolls.xls", sheet = "MLB Payrolls", na = "n/a")

Yields (on some systems in my classroom but not others):

Error: filepath: MLBPayrolls.xls libxls error: Unable to open file

The temporary solution was to paste the URL of the xls file into Firefox and download it via the browser. Once this was done we could run the read_excel line without error.

This was happening today on Windows 10, with R 3.6.2 and R Studio 1.2.5033.

UseR10085
  • 7,120
  • 3
  • 24
  • 54
  • Thanks, @Ryan! Like I mentioned on GitHub, the issue described in the OP was with files already downloaded locally, so this is likely a different matter. – Brunox13 Jan 31 '20 at 19:25
0

Resaving your file and you can solve your problem easily.

I also find this problem before but I get the answer from your discussion.

I used the read_excel() to open those files.

UseR10085
  • 7,120
  • 3
  • 24
  • 54
花落思量错
  • 352
  • 1
  • 11
  • Do you mean resaving the file with Excel? Yes, this has already been suggested, but it does not solve the problem when working with hunderds of files (you can imagine that opening & resaving each file in Microsoft Excel would take a very long time). – Brunox13 Aug 06 '20 at 16:01
  • And it makes no difference whether `read_xls()` or `read_excel()` is used - as per the [`read_excel()` documentation](https://www.rdocumentation.org/packages/readxl/versions/1.3.1/topics/read_excel), `read_xls()` is actually better: "`read_excel()` calls `excel_format()` to determine if path is xls or xlsx, based on the file extension and the file itself, in that order. Use `read_xls()` and `read_xlsx()` directly if you know better and want to prevent such guessing." – Brunox13 Aug 06 '20 at 16:05
0

If you have downloaded the .xls data from the internet, even if you are opening it in Ms.Excel, it will open a prompt first asking to confirm if you trust the source, see below screenshot, I am guessing this is the reason R (read_xls) also can't open it, as it's considered unsafe. Save it as .xlsx file and then use read_xlsx() or read_excel(). enter image description here

ahmadzai
  • 44
  • 5
0

Even thought this is not a code-based solution, I just changed the type file. For instance, instead of xls I saved as csv or xlsx. Then I opened it as regular one.
I worked it for me, because when I opened my xlsfile, I popped up the message: "The file format and extension of 'file.xls'' don't match. The file could be corrupted or unsafe..."