I'm importing xlsx
2007 tables into R 3.2.1patched
using package readxl 0.1.0
under Windows 7 64
. The tables' size is on the order of 25,000 rows by 200 columns.
Function read_excel()
works a treat. My only problem is with its assignment of column class (datatype) to sparsely populated columns. For example, a given column may be NA for 20,000 rows and then will take a character value on row 20,001. read_excel()
appears to default to column type numeric when scanning the first n rows of a column and finding NAs
only. The data causing the problem are chars in a column assigned numeric. When the error limit is reached, execution halts. I actually want the data in the sparse columns, so setting the error limit higher isn't a solution.
I can identify the troublesome columns by reviewing the warnings thrown. And read_excel()
has an option for asserting a column's datatype by setting argument col_types
according to the package docs:
Either NULL
to guess from the spreadsheet or a character vector containing blank
,numeric
, date
or text
.
But does this mean I have to construct a vector of length 200 populated in almost every position with blank
and text
in handful of positions corresponding to the offending columns?
There's probably a way of doing this in a couple lines of R
code. Create a vector of the required length and fill it with blank
s. Maybe another vector containing the numbers of the columns to be forced to text
, and then ... Or maybe it's possible to call out for read_excel()
just the columns for which its guesses aren't as desired.
I'd appreciate any suggestions.
Thanks in advance.