2

My application is reading the xls and xlsx files using the read_excel function of the readxl package.

The sequence and the exact number of columns are not known earlier while reading the xls or xlsx file. There are 15 predefined columns out of which 10 columns are mandatory and remaining 5 columns are optional. So the file will always have minimum 10 columns and at maximum 15 columns.

I need to specify the the col-types to the mandatory 10 columns. The only way I can think of is using the column names to specify the col_types as I know for fact that the file has all 10 columns which are mandatory but they are in the random sequence.

I tried looking out for the way of doing so but failed to do so.

Can anyone help me find a way to assign the col_types by column names?

Abhinandan Satpute
  • 2,558
  • 6
  • 25
  • 43
  • 1
    The short answer is no; it's not like `reader::cols`. If the structure is very regular you might be able to fake it, but that doesn't sound like what you have. A simpler option is just to read in everything and subset. If types are causing problems, read everything as character and use `readr::type_convert` or `lapply` `type.convert` to clean up. – alistaire May 16 '17 at 07:12
  • Yeah. I thought about that the option you have mentioned. I just wanted to avoid the overhead of type conversion post file reading. But as you said if there is no other option, I might need to go with it. – Abhinandan Satpute May 16 '17 at 07:21
  • Honestly, with 15 variables the time type conversion takes will be negligible until your data is millions of rows. – alistaire May 16 '17 at 07:27
  • @alistaire agree. – Abhinandan Satpute May 16 '17 at 07:30
  • @alistaire As i don't know the exact number of columns, I won't be be able to specify the `col_types`. Any thoughts on it? – Abhinandan Satpute May 16 '17 at 08:49
  • readr's `col_types` parameters can take a `cols` (or `cols_only` if you want to exclude excess variables) call, which can take column names attached to a specified parser. Otherwise, just stop worrying, read everything in, and clean it up after you've got it; trying to fix too much on import will take much more time than just fixing it after the fact with the typical `as.integer` and such. – alistaire May 16 '17 at 08:53

1 Answers1

1

I solve the problem by below workaround. It is not the best way to solve this problem though. I have read the excel file twice which will take a hit on performance if the file has very large volume of data.

First read: Building column data type vector- Reading the file for retrieving the columns information(like column names, number of columns and it's types) and building the column_data_types vector which will have the datatype for every column in the file.

#reading .xlsx file
site_data_columns <- read_excel(paste(File$datapath, ".xlsx", sep = ""))

site_data_column_names <- colnames(site_data_columns)

for(i in 1 : length(site_data_column_names)){  

    #where date is a column name
    if(site_data_column_names[i] == "date"){
         column_data_types[i] <- "date"

         #where result is a column name
         } else if (site_data_column_names[i] == "result") {
                      column_data_types[i] <- "numeric"

         } else{
                column_data_types[i] <- "text"
        }
}

Second read: Reading the file content- reading the excel file by supplying col_types parameter with the vector column_data_types which has the column data types.

#reading .xlsx file
site_data <- read_excel(paste(File$datapath, ".xlsx", sep = ""), col_types = column_data_types)
Abhinandan Satpute
  • 2,558
  • 6
  • 25
  • 43