read_excel won't trim whitespace

Question

I am using the package readxl to load an excel file. As default it should strip the white space however it is not doing so.

The file can be downloaded directly from the link below or alternatively it can be downloaded through the website where it is Appendix B

http://www2.nationalgrid.com/UK/Industry-information/Future-of-Energy/Electricity-Ten-Year-Statement/

http://www2.nationalgrid.com/WorkArea/DownloadAsset.aspx?id=8589937799

require(readxl);require(tidyverse)
test <- read_excel("ETYS 2016 Appendix B.xlsx", skip = 1, sheet = 22, trim_ws = TRUE)
print(test$`MVAr Generation`)
test$`MVAr Generation` %>% str_count(patter = "\\s")

test$`MVAr Generation` %>% table #all are numeric
test$`MVAr Generation` %>% class #however the class is characer

test$`MVAr Generation` %>% str_count(patter = "\\s") %>%
sum(na.rm = T) #It should be 0 however it is 2

This problem is causing problems in the analysis as can be seen by this example in which the numeric column is a character. Help would be appreciated

score 3 · Answer 1 · answered May 23 '17 at 17:55

library(readxl)

readxl::excel_sheets('ETYS 2016 Appendix B.xlsx')[22]
test <- read_excel("ETYS 2016 Appendix B.xlsx", skip = 1, sheet = 22, 
                   trim_ws = FALSE)
test$`MVAr Generation` <- as.numeric(gsub('^\\s', "", test$`MVAr Generation`))

The error is probably due to character encoding. I get this error when I forced numeric interpretation of the column:

Expecting numeric in D9 / R9C4: got 'Â 225'

You can manually avoid this by substituting leading spaces with gsub.

Good idea with the unrecognised symbols however, this deletes the numbers with the whitespace as well, so doesn't work. — Jonno Bourne, May 25 '17 at 11:10

jay.sf · Answer 2 · 2017-05-23T19:11:09.297

Maybe this is what you want:

library(xlsx)
test <- read.xlsx("ETYS 2016 Appendix B.xlsx", sheetName = 22, 
              colIndex = 1:7, startRow = 2, header = TRUE, 
              stringsAsFactors = FALSE)

# remove whitespace
test <- data.frame(lapply(test, function(y) {
           y <- gsub("^\\s+", "", y); 
           y <- gsub("Â", "", y); y
           y <- gsub("^\\s+", "", y); 
           }))

# set tidy cols to numeric
cols = c(3, 4, 5, 7)
test[,cols] = apply(test[,cols], 2, function(x) as.numeric(x))

# test
class(test$Unit.Number)
test$MVAr.Absorption

score 1 · Accepted Answer · answered May 25 '17 at 11:15

The insight of @troh with the character encoding got me to think about using regex. @jaySF 's application across the whole dataframe was a good way to process all the columns at same time. The two suggestions lead me to the below answer.

require(dplyr);require(purrr);require(readr)
RemoveSymbols <-function(df)  {
  df  %>% mutate_all( funs(gsub("^[^A-Z0-9]", "", ., ignore.case = FALSE))) %>%
     map_df(parse_guess) 
}

test2 <- RemoveSymbols(test)

sapply(test2,class)

read_excel won't trim whitespace

3 Answers3