0

I'm dealing with a huge dataset having 500 columns and a huge number of rows out of which I can take a significantly big sample (e.g. 1 million).

All the columns are in the character format although they can represent different data types:numeric, date, ... I need to build a function that, given a column as an input, recognized its format, taking account of NA values as well.

For instance, given a column col, I recognise if it is numeric in this way.

col <- c(as.character(runif(10000)), rep('NaN', 10))
maxPercNa <- 0.10
nNa <- sum(is.na(as.numeric(col)))
percNa <- nNa / length(col)
isNumeric <- percNa < maxPercNa

In a similar way, I need to recognise dates, integers, ... I was thinking about using regular expressions. A challenge is that the dataset is very big, so the technique should be efficient.

If anyone comes up with a brilliant idea, it'll be really appreciated :) Thanks in advance!

Michele Usuelli
  • 1,970
  • 13
  • 15
  • 2
    Use `read.table()` that has this built-in – Andrie Dec 08 '14 at 16:42
  • Hi Andrie. The data is in a huge csv that will be converted into the xdf format. The operation will be within a DataStep. We were thinking about regular expressions for each data chunk and a summary in the end. – Michele Usuelli Dec 08 '14 at 16:44
  • 3
    Great, `read.table()` was built for csv files. It's perfect for your question. – Andrie Dec 08 '14 at 16:46
  • There's also a `read.csv()` and `read.csv2()` function - which are `read.table` with parameters set for .csv files. – talat Dec 08 '14 at 17:08
  • 1
    ...and if 'read.table()` seems slow, there is also `fread()` in the **data.table** package. – joran Dec 08 '14 at 17:18
  • For the "I need to recognise dates" part, see e.g. [**here**](http://stackoverflow.com/questions/18390674/automatically-detect-date-columns-when-reading-a-file-into-a-data-frame). – Henrik Dec 08 '14 at 18:26

0 Answers0