R readxl returning NA values from Excel file

Question

I am having trouble reading data from *.xls files into R. I'm trying to use readxl::read_xls() to read the data in the Microsoft Excel file from following URL: https://www.misoenergy.org/Library/Repository/Market%20Reports/20171114_5min_exante_lmp.xls. I am on R version 3.4.1 (Single Candle) and the output from sessionInfo() is pasted at the very bottom of this post.

The file has 6 sheets that contain data. As a minimal example, consider reading the second sheet, named RT Ex-Ante 5 Minute LMPs(1). The code below was my first attempt at reading this sheet:

library(readxl)
fpath <- '/Users/bmosovsky/Downloads/20171114_5min_exante_lmp.xls'
data <- read_excel( path=fpath, sheet=2, col_names=FALSE )

This allows read_excel to guess the range of data to read and the column types. I get the the warning message,

Warning message:
In read_fun(path = path, sheet = sheet, limits = limits, shim = shim,  :
  Expecting logical in B65535 / R65535C2: got 'IPL.CC.IPLEV01'

and str(data) returns

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   65535 obs. of  6 variables:
 $ X__1: POSIXct, format: "2017-11-13 04:35:00" "2017-11-13 04:35:00" "2017-11-13 04:35:00" "2017-11-13 04:35:00" ...
 $ X__2: logi  NA NA NA NA NA NA ...
 $ X__3: logi  NA NA NA NA NA NA ...
 $ X__4: logi  NA NA NA NA NA NA ...
 $ X__5: logi  NA NA NA NA NA NA ...
 $ X__6: logi  NA NA NA NA NA NA ...

Thinking that perhaps read_excel() was just incorrectly guessing the column types, I then tried:

data1 <- read_excel( path=fpath, sheet=2, col_names=FALSE, 
                    col_types=c('text', 'text', 'numeric', 'numeric', 'numeric', 'numeric') )

This got rid of the warning, as the columns were correctly typed, but I'm still getting NA values for all columns except the first. This time str(data1) returned

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   65535 obs. of  6 variables:
 $ X__1: chr  "43052.2" "43052.2" "43052.2" "43052.2" ...
 $ X__2: chr  NA NA NA NA ...
 $ X__3: num  NA NA NA NA NA NA NA NA NA NA ...
 $ X__4: num  NA NA NA NA NA NA NA NA NA NA ...
 $ X__5: num  NA NA NA NA NA NA NA NA NA NA ...
 $ X__6: num  NA NA NA NA NA NA NA NA NA NA ...

Finally, I tried pasting the first 10 rows of data (formats and all) from the second sheet of the Excel file into a fresh Excel workbook, saving as test.xls, and trying the following:

fpath_test <- '/Users/bmosovsky/Downloads/test.xls'
data_test <- read_excel( path=fpath_test, sheet=1, col_names=FALSE,
                         col_types=c('text', 'text', 'numeric', 'numeric', 'numeric', 'numeric') )

Now str(data_test) returns the correct result:

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   10 obs. of  6 variables:
 $ X__1: chr  "43052.2" "43052.2" "43052.2" "43052.2" ...
 $ X__2: chr  "CIN.MARKLND.3" "CIN.MIAMWAB.1" "CIN.MIAMWAB.2" "CIN.MIAMWAB.3" ...
 $ X__3: num  22.4 22.6 22.6 22.6 22.5 ...
 $ X__4: num  21.6 21.6 21.6 21.6 21.6 21.6 21.6 21.6 21.6 21.6
 $ X__5: num  0.8 1.02 1.02 1.02 0.92 0.93 1.29 1.29 1.29 0.06
 $ X__6: num  0.04 0.01 0.01 0.01 0.01 0.01 0.05 0.05 0.05 0.06

So, my question is, what is unique to the downloaded Excel file that won't allow the data to be read into R properly? I am trying to read in this data as part of an automated data harvesting process, so any kind of manual manipulation of the Excel file is not possible as a workaround. Can anyone provide some insight as to how I can get data from all the sheets of the .xls file into R for processing?

Here is the output from sessionInfo():

R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] tools     stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.2      rvest_0.3.2       xml2_1.1.1        RPostgreSQL_0.6-2 DBI_0.7-12        lubridate_1.6.0   dplyr_0.7.2       readxl_1.0.0     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.12.3   tidyr_0.6.3      assertthat_0.2.0 cellranger_1.1.0 R6_2.2.2         magrittr_1.5     httr_1.2.1       rlang_0.1.1      stringi_1.1.5   
[10] curl_2.8.1       stringr_1.2.0    glue_1.1.1       compiler_3.4.1   pkgconfig_2.0.1  bindr_0.1        tibble_1.3.3

That must be an oddity with the specific XLS file. I see the same symptom, but when I make one minor change to an unimportant cell and save (in the same "xls" format), everything reads just fine. — r2evans, Nov 15 '17 at 20:25
That's what I was afraid of. Other similar files from the website appear to be plagued with the same issue. Why MISO would provide data files in XLS format in the first place is beyond me, but now I need to figure out a way to wrangle this data into R for processing/normalization. Any suggestions to this effect would be greatly appreciated, even "hacky" ones. — bmosov01, Nov 15 '17 at 20:36
Okay figured it out! They merged cells in the first 3 rows. And even using the option of skip=3 was throwing read_excel off. 2 Hacky ways to do this. 1. Remove the first three rows manually, save the file and use read_excel. 2. Save the file as xlsx without removing anything and use this `foo <-read_excel(path = "~/Downloads/20171114_5min_exante_lmp.xlsx",skip=3, col_names = T, trim_ws = T,col_types = c("date","text",rep("guess",4)), sheet = 1)` — infominer, Nov 15 '17 at 20:41
I didn't un-merge any cells, and it is reading fine for me, so I suspect the "mitigation" is have excel re-save it, for whatever reason. This might be automate-able using something like `RDCOMClient` (with which I have zero experience, sorry). — r2evans, Nov 15 '17 at 20:42
I do think this could be presented as [a new issue](https://github.com/tidyverse/readxl/issues), and though the issue might be more desirably received if you can determine the property that is buggering the system, it's possible -- even likely -- that one of the coders (top 3 being hadley, jennybc, and jeroen) could have better sleuthing tools given the specific example file. — r2evans, Nov 15 '17 at 20:53
Good catch on the merged cells @infominer. This is a troubling issue. From browsing the readxl github page there appear to be references to two promising Excel packages: [tidyxl](https://github.com/nacnudus/tidyxl#readme) and [rexel](https://github.com/rsheets/rexcel#readme). However, both are for `xlsx` files only, and mine are the old binary `xls` files. I don't have MS Excel installed on the machine where the data harvest will occur, so I can't re-save the files. Manual solutions won't work either. So, barring more creative solutions, it appears I may be out of luck here? — bmosov01, Nov 15 '17 at 21:30
Actually, I'm not sure the merged cells is the actual issue. There are no merged cells on sheet 2, but the issue is still present. — bmosov01, Nov 15 '17 at 21:44
Hmm! you're right. BTW I saved it as xlsx not manually but using this on an Ubuntu OS at the commandline `libreoffice --headless --convert-to xlsx 20171114_5min_exante_lmp.xls` Now you can read this into R using `foo<-read_excel(path="~/Downloads/20171114_5min_exante_lmp.xlsx",sheet = 1)`. Also look up xls2csv and xlsx2csv, that you can use at the commandline UPDATE:After converting it to xlsx using my libreoffice command, I had no problems reading all sheets. As you might already know, headers only exist on sheet 1. — infominer, Nov 15 '17 at 22:28

R readxl returning NA values from Excel file

0 Answers0