0

I have a dataset which looks like:

mother_id,dateOfBirth
1,1962-09-24
2,1991-02-19
3,1978-11-11

I need to extract the constituent elements (day,month,year) from date of birth and put them in corresponding columns to look like:

mother_id,dateOfBirth,dayOfBirth,monthOfBirth,yearOfBirth
1,1962-09-24,24,09,1962
2,1991-02-19,19,02,1991
3,1978-11-11,11,11,1978

Currently, I have it coded as a loop:

data <- read.csv("/home/tumaini/Desktop/IHI-Projects/Data-Linkage/matching file dss nacp.csv",stringsAsFactors = F)
dss_individuals <- read.csv("/home/tumaini/Desktop/IHI-Projects/Data-Linkage/Data/dssIndividuals.csv", stringsAsFactors = F)

lookup <- data[,c("patientid","extId")]

# remove duplicates
lookup <- lookup[!(duplicated(lookup$patientid)),]

dss_individuals$dateOfBirth <- as.character.Date(dss_individuals$dob)


dss_individuals$dayOfBirth <- 0
dss_individuals$monthOfBirth <- 0
dss_individuals$yearOfBirth <- 0

# Loop starts here    
for(i in 1:nrow(dss_individuals)){ #nrow(dss_individuals)
    split_list <- unlist(strsplit(dss_individuals[i,]$dateOfBirth,'[- ]'))

    dss_individuals[i,]["dayOfBirth"] <- split_list[3]
    dss_individuals[i,]["monthOfBirth"] <- split_list[2]
    dss_individuals[i,]["yearOfBirth"] <- split_list[1]
}

This seems to work, but is horrendously slow as I have 400 000 rows. Is there a way I can get this done more efficiently?

Tumaini Kilimba
  • 195
  • 2
  • 15

5 Answers5

3

I compared the speed of substr, format, and use of lubridate. It seems that lubridate and format are much faster than substr, if the the variable is stored as date. However, substr would be fastest if the variable is stored as character vector. The results of a single run is shown.

x <- sample(
    seq(as.Date('1000/01/01'), as.Date('2000/01/01'), by="day"),
    400000, replace = T)

system.time({
    y <- substr(x, 1, 4)
    m <- substr(x, 6, 7)
    d <- substr(x, 9, 10)
})
# user  system elapsed 
# 3.775   0.004   3.779 

system.time({
    y <- format(x,"%y")
    m <- format(x,"%m")
    d <- format(x,"%d")
})
# user  system elapsed 
# 1.118   0.000   1.118 

system.time({
    y <- year(x)
    m <- month(x)
    d <- day(x)
})
# user  system elapsed 
# 0.951   0.000   0.951 

x1 <- as.character(x)
system.time({
    y <- substr(x1, 1, 4)
    m <- substr(x1, 6, 7)
    d <- substr(x1, 9, 10)
})
# user  system elapsed 
# 0.082   0.000   0.082 
mt1022
  • 16,834
  • 5
  • 48
  • 71
  • You can simplify how you time an operation by wrapping it inside the system.time for example `system.time({ ys.time() y <- substr(x, 1, 4) m <- substr(x, 6, 7) d <- substr(x, 9, 10) }) ` – gowerc Jul 04 '16 at 12:52
  • Thanks. I edited the code following suggestion. ;). @C_G – mt1022 Jul 04 '16 at 13:00
  • 1
    Can you include the results in your post? – James Jul 04 '16 at 13:09
  • 1
    Sure. I added results for a single run. @James – mt1022 Jul 04 '16 at 13:18
  • @user53777 Thanks. Interesting results. You should probably put the `as.character` call into the timing too to be a fair comparison. – James Jul 04 '16 at 14:02
2

Not sure if this will solve your speed issues but here is a nicer way of doing it using dplyr and lubridate. In general when it comes to manipulating data.frames I personally recommend using either data.tables or dplyr. Data.tables is supposed to be faster but dplyr is more verbose which I personally prefer as I find it easier to pick up my code after not having read it for months.

library(dplyr)
library(lubridate)

dat <- data.frame( mother_id = c(1,2,3),
                   dateOfBirth = ymd(c( "1962-09-24" ,"1991-02-19" ,"1978-11-11"))
)


dat %>%  mutate( year  = year(dateOfBirth) , 
                 month = month(dateOfBirth),
                 day   = day(dateOfBirth)  )

Or you can use the mutate_each function to save having to write the variable name multiple times (though you get less control over the name of the output variables)

dat %>% mutate_each( funs(year , month , day) , dateOfBirth)
gowerc
  • 1,039
  • 9
  • 18
  • As shown [by a poster below](http://stackoverflow.com/a/38185203/4651564) lubridate is in fact faster than many of the other solutions suggested so far :) – gowerc Jul 04 '16 at 12:55
  • `lubridate` uses POSIXlt objects behind the scenes. These use 5x more memory than Date objects, and so potentially this could be slower when you have a lot of data. – James Jul 04 '16 at 14:13
2

Here are some solutions. These solutions each (i) use 1 or 2 lines of code and (ii) return numeric year, month and day columns. In addition, the first two solutions use no packages -- the third uses chron's month.day.year function.

1) POSIXlt Convert to "POSIXlt" class and pick off the parts.

lt <- as.POSIXlt(DF$dateOfBirth, origin = "1970-01-01")
transform(DF, year = lt$year + 1900, month = lt$mon + 1, day = lt$mday)

giving:

  mother_id dateOfBirth year month day
1         1  1962-09-24 1962     9  24
2         2  1991-02-19 1991     2  19
3         3  1978-11-11 1978    11  11

2) read.table

cbind(DF, read.table(text = format(DF$dateOfBirth), sep = "-", 
  col.names = c("year", "month", "day")))

giving:

  mother_id dateOfBirth year month day
1         1  1962-09-24 1962     9  24
2         2  1991-02-19 1991     2  19
3         3  1978-11-11 1978    11  11

3) chron::month.day.year

library(chron)
cbind(DF, month.day.year(DF$dateOfBirth))

giving:

  mother_id dateOfBirth month day year
1         1  1962-09-24     9  24 1962
2         2  1991-02-19     2  19 1991
3         3  1978-11-11    11  11 1978

Note 1: Often when year, month and day are added to data it is not really necessary and in fact they could be generated on the fly when needed using format, substr or as.POSIXlt so you might critically examine whether you actually need to do this.

Note 2: The input data frame, DF in reproducible form, was assumed to be:

Lines <- "mother_id,dateOfBirth
1,1962-09-24
2,1991-02-19
3,1978-11-11"

DF <- read.csv(text = Lines)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
1

Use format once for each part:

dss_individuals$dayOfBirth <- format(dss_individuals$dateOfBirth,"%d")
dss_individuals$monthOfBirth <- format(dss_individuals$dateOfBirth,"%m")
dss_individuals$yearOfBirth <- format(dss_individuals$dateOfBirth,"%Y")
James
  • 65,548
  • 14
  • 155
  • 193
0

Check the substr function from the base package (or other functions from the nice stringr package) to extract different parts of a string. This function may assume that day, month and year are always in the same place and with the same length.

The strsplit function is vectorized so using rbind.data.frame to convert your list to a dataframe works:

do.call(rbind.data.frame, strsplit(df$dateOfBirth, split = '-'))

Results need to be transposed in order to be used: you can do it using do.call or the t function.

lefish
  • 1