0

I have a 400 column dataframe with multiple date columns interspersed. In the representative example below I would like to achieve the following:

  1. turn factors into numeric OR character OR POSIXct
  2. turn characters to numeric/integer where sensible
  3. turn any column containing a date to POSIXct, regardless of whether it is factor or character or date

    set.seed(123)
    df1 <- data.frame(
    A = as.numeric(1:10),
    B = sample(seq(as.POSIXct('2000/01/01'), as.POSIXct('2018/01/01'), by="day"), size=10),
    C = as.numeric(sample(20:90, size = 10)),
    D = sample(c("yes", "no"), size=10, replace = TRUE),
    E = as.factor(sample(1000:2000, size = 10)),
    F = as.character(c("test","test2","test3","test4","test5","test6","test7","test8","test9","test10")),
    G = as.factor(c("test","test2","test3","test4","test5","test6","test7","test8","test9","test10")),
    H = as.character(sample(seq(as.POSIXct('2000/01/01'), as.POSIXct('2018/01/01'), by="day"), size=10)),stringsAsFactors=FALSE
    )
    df1
    A                   B  C   D    E      F      G                   H
    1   1 2005-03-06 00:00:00 87  no 1963   test   test 2002-07-27 23:00:00
    2   2 2014-03-11 00:00:00 51  no 1902  test2  test2 2007-06-17 23:00:00
    3   3 2007-05-11 23:00:00 66  no 1690  test3  test3 2007-06-11 23:00:00
    4   4 2015-11-22 00:00:00 58  no 1793  test4  test4 2006-08-20 23:00:00
    5   5 2016-12-02 00:00:00 26  no 1024  test5  test5 2002-09-27 23:00:00
    6   6 2000-10-26 00:00:00 79  no 1475  test6  test6 2002-06-30 23:00:00
    7   7 2009-06-30 23:00:00 35  no 1754  test7  test7 2004-03-11 00:00:00
    8   8 2016-01-19 00:00:00 22  no 1215  test8  test8 2008-05-17 23:00:00
    9   9 2009-11-30 00:00:00 40 yes 1315  test9  test9 2004-10-12 00:00:00
    10 10 2008-03-17 00:00:00 85 yes 1229 test10 test10 2015-06-03 23:00:00
    
    unlist(lapply(df1, class))
      A          B1          B2           C           D           E           F           G           H 
      "numeric"   "POSIXct"    "POSIXt"   "numeric" "character"    "factor" "character"    "factor" "character" 
    

So far, I've tried the following (but it doesn't preserve the POSIXct column B) or turn the character date column (column H) into POSIXct:

df1_clean <- df1 %>% mutate_all(funs(type.convert(as.character(.), as.is = TRUE)))
unlist(lapply(df1_clean, class))
      A           B           C           D           E           F            G           H 
      "integer" "character"   "integer" "character"   "integer" "character" "character" "character" 

For this small dataset I could just call on the columns and convert B and H to POSIXct with lubridate but I want it to be automatically across the dataframe.

Any help would be much appreciated! Thanks Moe

Moe
  • 131
  • 1
  • 9
  • 1
    I don't understand: you explicitly convert all columns into `character`, and then you are surprised when they lose their `POSIXct`-ness? In R (disregarding auto-coersion), classes are usually mutually exclusive: an `integer` cannot be a `character`, and a `character` cannot be a `POSIXct`, so when you say `as.character(.)`, you are voluntarily and explicitly giving up the datetime-ness of it to turn it into a string. Are you trying to do this dynamically without a priori knowledge as to which is which, or can you use `mutate_at(vars(A,C,...), funs(as.character(.)))`? – r2evans Apr 27 '18 at 00:34
  • Thanks r2evans! I'm not surprised that everything turns to character - `type.convert` is just the only way that I have come across that dynamically re-assigns classes in R. I don't know how to write it as code but one way it could work would be: i) turn character dates (e.g. column H) to `POSIXct`, ii) creat a list of vars that are not `POSIXct` and finally iii) convert all those vars that are not `POSIXct` with the `type.convert`. hope this clarifies – Moe Apr 27 '18 at 00:54
  • Notice that the help for `type.convert` does not mention `POSIXct`, so you're on your own if you use that function. Do you know ahead of time which columns are supposed to be which? Or are you trying to come up with a general function that will happily/smartly do this for you no matter what you throw at it? – r2evans Apr 27 '18 at 01:11
  • I'm happy to use a better alternative to type.convert, if you know of one. Yes, I'm after a general function that will assign correct class no matter what I throw at it. – Moe Apr 27 '18 at 01:54
  • The functions that do "automatic type determination" are generally written to read the data *into* R, not deal with an object already in the R environment. If this is coming from a CSV or similar file, might I suggest the `readr` package? It first checks the first 1000 rows (configurable) and auto-types for you. – r2evans Apr 27 '18 at 02:03

1 Answers1

0

It may not be the most elegant way - but it seems to work for me.

#install.packages("tidyverse")
#install.packages("dataCompareR")
library("tidyverse")
library("dataCompareR")



# create reproducible df
set.seed(123)
df1 <- data.frame(
  A = as.numeric(1:10),
  B = sample(seq(as.POSIXct('2000/01/01', tz = "UTC"), as.POSIXct('2018/01/01', tz = "UTC"), by="day"), size=10),
  C = as.numeric(sample(20:90, size = 10)),
  D = sample(c("yes", "no"), size=10, replace = TRUE),
  E = as.factor(sample(1000:2000, size = 10)),
  F = as.character(c("test","test2","test3","test4","test5","test6","test7","test8","test9","test10")),
  G = as.factor(c("test","test2","test3","test4","test5","test6","test7","test8","test9","test10")),
  H = as.character(sample(seq(as.POSIXct('2000/01/01', tz = "UTC"), as.POSIXct('2018/01/01', tz = "UTC"), by="day"), size=10)),stringsAsFactors=FALSE
)
df1 #look at df

unlist(lapply(df1, class)) #look at df classes


df1_clean <- df1 %>% mutate_all(funs(type.convert(as.character(.), as.is = TRUE))) #reassign classes by running type.convert (input are all variables from the df but as.character)
unlist(lapply(df1_clean, class)) #look at df classes now

#check if a column is a Date - https://stackoverflow.com/questions/18178451/is-there-a-way-to-check-if-a-column-is-a-date-in-r
tmp=sapply(df1_clean, function(x) !all(is.na(as.Date(as.character(x),format="%Y-%m-%d", tz = "UTC")))) 

# if tmp is True, change according column to as.POSIXct 
for (i in 1:ncol(df1_clean)){
  if (tmp[i] == T){
    df1_clean[,i]<- as.POSIXct(df1_clean[,i], tz = "UTC")
  }
}

df1_clean #look at df
unlist(lapply(df1_clean, class)) #look at df classes


comp <- rCompare(df1, df1_clean) #compare your dfs before and after using the dataCompareR package
summary(comp) # check summary
Moe
  • 131
  • 1
  • 9