0

I am trying to sort out any multiple entries per day, by selecting the first registered entry each day, per subject ID.

I am handling a very big data set, so here only a snapshot of my data structure:

 df <- c(Contact.ID, Date.Time, Age, Gender, Attendance)

Contact.ID       Date.Time       Age   Gender   Attendance   
1   A       2012-07-06 18:54:48   37    Male         30    
2   A       2012-07-06 20:50:18   37    Male         30    
3   A       2012-08-14 20:18:44   37    Male         30   
4   B       2012-03-15 16:58:15   27  Female         40    
5   B       2012-04-18 10:57:02   27  Female         40    
6   B       2012-04-18 17:31:22   27  Female         40    
7   B       2012-04-18 18:37:00   27  Female         40    
8   C       2013-10-22 17:46:07   40    Male         5    
9   C       2013-10-27 11:21:00   40    Male         5    
10  D       2012-07-28 14:48:33   20  Female         12 

I have tried a few different things such as:

t.first <- df[match(unique(df$Date.Time), df$Date.Time),]

setDT(df)[,.SD[which.max(df$Date.Time)],keyby=df$Contact.ID]

library(dplyr)
t.first <- ddply(df, "Date.Time", function(z) tail(z,1))

But none of them get me the first entry given my specific subject ID.

So what I need to be left with at the end is a data set such that:

Contact.ID       Date.Time       Age   Gender   Attendance   
1   A       2012-07-06 18:54:48   37    Male         29    
2   A       2012-08-14 20:18:44   37    Male         29   
3   B       2012-03-15 16:58:15   27  Female         38    
4   B       2012-04-18 10:57:02   27  Female         38    
5   C       2013-10-22 17:46:07   40    Male         5    
6   C       2013-10-27 11:21:00   40    Male         5    
7   D       2012-07-28 14:48:33   20  Female         12 

Please, if anyone could help, I have been stuck on this for way too long.

Chava Geldzahler
  • 3,605
  • 1
  • 18
  • 32
Fee
  • 89
  • 1
  • 9

2 Answers2

2

A solution from dplyr and lubridate. We can convert Date.Time to datetime class, create a new variable called date, grouping by Contact.ID and Date and select the minimum record of each group. dt2 is the final output.

library(dplyr)
library(lubridate)

dt2 <- dt %>%
  mutate(Date.Time = ymd_hms(Date.Time)) %>%
  mutate(Date = as.Date(Date.Time)) %>%
  group_by(Contact.ID, Date) %>%
  filter(Date.Time == min(Date.Time)) %>%
  ungroup() %>%
  select(-Date)

dt2
# A tibble: 7 x 5
  Contact.ID           Date.Time   Age Gender Attendance
       <chr>              <dttm> <int>  <chr>      <int>
1          A 2012-07-06 18:54:48    37   Male         30
2          A 2012-08-14 20:18:44    37   Male         30
3          B 2012-03-15 16:58:15    27 Female         40
4          B 2012-04-18 10:57:02    27 Female         40
5          C 2013-10-22 17:46:07    40   Male          5
6          C 2013-10-27 11:21:00    40   Male          5
7          D 2012-07-28 14:48:33    20 Female         12

Data Preparation

dt <- read.table(text = "'Contact.ID' 'Date.Time' Age Gender Attendance
1 A '2012-07-06 18:54:48' 37 Male 30
                 2 A '2012-07-06 20:50:18' 37 Male 30
                 3 A '2012-08-14 20:18:44' 37 Male 30
                 4 B '2012-03-15 16:58:15' 27 Female 40
                 5 B '2012-04-18 10:57:02' 27 Female 40
                 6 B '2012-04-18 17:31:22' 27 Female 40
                 7 B '2012-04-18 18:37:00' 27 Female 40
                 8 C '2013-10-22 17:46:07' 40 Male 5
                 9 C '2013-10-27 11:21:00' 40 Male 5
                 10 D '2012-07-28 14:48:33' 20 Female 12",
                 header = TRUE, stringsAsFactors = FALSE)
www
  • 38,575
  • 12
  • 48
  • 84
  • Will return multiple rows if there are > 1 record with the same Date.Time and that Date.Time also happens to be the earliest. Remote possibility but just putting it out there. – Josh Gilfillan Aug 15 '17 at 07:26
  • Hey! Thanks a lot for your reply! Unfortunately, I get this error message: Error in UseMethod("mutate_") : no applicable method for 'mutate_' applied to an object of class "function" but the other comment worked. Thanks a lot anyway :) – Fee Aug 15 '17 at 22:31
  • I believe the error message means you apply the `mutate` function to another function, but `mutate` should be applied to a data frame. Since `df` is a pre-defined function in R, it is likely that you apply my code before define your data frame `df`. This is why I usually name my data frame as `dt`, not `df`, to avoid any confusion. – www Aug 16 '17 at 01:23
  • To prove my point, you can run `library(dplyr) mutate(df)` in a fresh new R working environment. The console should report the same error message. It doesn't matter which answer you accepted, I just want to be clear that my answer should be valid. – www Aug 16 '17 at 01:25
  • Thank you very much for clarifying this :) – Fee Aug 19 '17 at 23:50
2

Another option using dplyr::slice(). This will prevent duplicates.

library(dplyr)
library(lubridate)

dt2 <- dt %>%
  mutate(Date.Time = ymd_hms(Date.Time)) %>%
  mutate(Date = as.Date(Date.Time)) %>%
  group_by(Contact.ID, Date) %>%
  arrange(Date.Time) %>%
  slice(1) %>%
  ungroup() %>%
  select(-Date)
Josh Gilfillan
  • 4,348
  • 2
  • 24
  • 26