I'm dealing with a huge dataset having 500 columns and a huge number of rows out of which I can take a significantly big sample (e.g. 1 million).
All the columns are in the character format although they can represent different data types:numeric, date, ... I need to build a function that, given a column as an input, recognized its format, taking account of NA values as well.
For instance, given a column col, I recognise if it is numeric in this way.
col <- c(as.character(runif(10000)), rep('NaN', 10))
maxPercNa <- 0.10
nNa <- sum(is.na(as.numeric(col)))
percNa <- nNa / length(col)
isNumeric <- percNa < maxPercNa
In a similar way, I need to recognise dates, integers, ... I was thinking about using regular expressions. A challenge is that the dataset is very big, so the technique should be efficient.
If anyone comes up with a brilliant idea, it'll be really appreciated :) Thanks in advance!