Identify the format of a character column in R

Question

I'm dealing with a huge dataset having 500 columns and a huge number of rows out of which I can take a significantly big sample (e.g. 1 million).

All the columns are in the character format although they can represent different data types:numeric, date, ... I need to build a function that, given a column as an input, recognized its format, taking account of NA values as well.

For instance, given a column col, I recognise if it is numeric in this way.

col <- c(as.character(runif(10000)), rep('NaN', 10))
maxPercNa <- 0.10
nNa <- sum(is.na(as.numeric(col)))
percNa <- nNa / length(col)
isNumeric <- percNa < maxPercNa

In a similar way, I need to recognise dates, integers, ... I was thinking about using regular expressions. A challenge is that the dataset is very big, so the technique should be efficient.

If anyone comes up with a brilliant idea, it'll be really appreciated :) Thanks in advance!

Hi Andrie. The data is in a huge csv that will be converted into the xdf format. The operation will be within a DataStep. We were thinking about regular expressions for each data chunk and a summary in the end. — Michele Usuelli, Dec 08 '14 at 16:44
Great, `read.table()` was built for csv files. It's perfect for your question. — Andrie, Dec 08 '14 at 16:46
There's also a `read.csv()` and `read.csv2()` function - which are `read.table` with parameters set for .csv files. — talat, Dec 08 '14 at 17:08
...and if 'read.table()` seems slow, there is also `fread()` in the **data.table** package. — joran, Dec 08 '14 at 17:18
For the "I need to recognise dates" part, see e.g. [**here**](http://stackoverflow.com/questions/18390674/automatically-detect-date-columns-when-reading-a-file-into-a-data-frame). — Henrik, Dec 08 '14 at 18:26

Identify the format of a character column in R

0 Answers0