0

I'm relatively new to R, so I realise this type of question is asked often but I've read a lot of stack overflow posts and still can't quite get something to work on my data.

I have data on spss, in two datasets imported into R. Both of my datasets include an id (IDC), which I have been using to merge them. Before merging, I need to filter one of the datasets to select specifically the last observation of a date variable.

My dataset, d1, has a longitudinal measure in long format. There are multiple rows per IDC, representing different places of residence (neighborhood). Each row has its own "start_date", which is a variable that is NOT necessarily unique.

As it looks on spss :

IDC neighborhood start_date
1 22 08.07.2001
1 44 04.02.2005
1 13 21.06.2010
2 44 24.12.2014
2 3 06.03.2002
3 22 04.01.2006
4 13 08.07.2001
4 2 15.06.2011

In R, the start dates do not look the same, instead they are one long number like "13529462400". I do not understand this format but I assume it still would retain the date order...

Here are all my attempts so far to select the last date. All attempts ran, there was no error. The output just didn't give me what I want. To my perception, none of these made any change in the number of repetitions of IDC, so none of them actually selected *only the last date.

##### attempt 1 --- not working
d1$start_date_filt <- d1$start_date
d1[order(d1$IDC,d1$start_date_filt),]  # Sort by ID and week
d1[!duplicated(d1$IDC, fromLast=T),] # Keep last observation per ID)

###### attempt 2--- not working 
myid.uni <- unique(d1$IDC)
a<-length(myid.uni)

last <- c()

for (i in 1:a) {
  temp<-subset(d1, IDC==myid.uni[i])
  if (dim(temp)[1] > 1) {
    last.temp<-temp[dim(temp)[1],]
  }
  else {
    last.temp<-temp
  }
  last<-rbind(last, last.temp)
}

last

##### atempt 3 -- doesn't work

do.call("rbind",
        by(d1,INDICES = d1$IDC,
      FUN=function(DF)
      DF[which.max(DF$start_date),]))

#### attempt 4 -- doesnt work 
library(plyr)
ddply(d1,.(IDC), function(X)
  X[which.max(X$start_date),])

### merger code -- in case something has to change with that after only the last start_date is selected
merge(d1,d2, IDC)

My goal dataset d1 would look like this:

IDC neighborhood start_date
1 13 21.06.2010
2 44 24.12.2014
3 22 04.01.2006
4 2 15.06.2011

I'm grateful for any help, many thanks <3

Amanda
  • 3
  • 1

1 Answers1

1

There are some problems with most approaches dealing with this data: because your dates are arbitrary strings in a format that does not sort correctly, it just-so-happens to work here because the maximum day-of-month also happens in the maximum year.

It would generally be better to work with that column as a Date object in R, so that comparisons can be better.

dat$start_date <- as.Date(dat$start_date, format = "%d.%m.%Y")
dat
#   IDC neighborhood start_date
# 1   1           22 2001-07-08
# 2   1           44 2005-02-04
# 3   1           13 2010-06-21
# 4   2           44 2014-12-24
# 5   2            3 2002-03-06
# 6   3           22 2006-01-04
# 7   4           13 2001-07-08
# 8   4            2 2011-06-15

From here, things are a bit simpler:

Base R

do.call(rbind, by(dat, dat[,c("IDC"),drop=FALSE], function(z) z[which.max(z$start_date),]))
#   IDC neighborhood start_date
# 1   1           13 2010-06-21
# 2   2           44 2014-12-24
# 3   3           22 2006-01-04
# 4   4            2 2011-06-15

dplyr

dat %>%
  group_by(IDC) %>%
  slice(which.max(start_date)) %>%
  ungroup()
# # A tibble: 4 x 3
#     IDC neighborhood start_date
#   <int>        <int> <date>    
# 1     1           13 2010-06-21
# 2     2           44 2014-12-24
# 3     3           22 2006-01-04
# 4     4            2 2011-06-15

Data

dat <- structure(list(IDC = c(1L, 1L, 1L, 2L, 2L, 3L, 4L, 4L), neighborhood = c(22L, 44L, 13L, 44L, 3L, 22L, 13L, 2L), start_date = c("08.07.2001", "04.02.2005", "21.06.2010", "24.12.2014", "06.03.2002", "04.01.2006", "08.07.2001", "15.06.2011")), class = "data.frame", row.names = c(NA, -8L))
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Hey thank you for the quick reply! Here's what happened. You're completely right about the date format not being imported into R. Because my import (using the package foreign) made the date look like this "13529462400", I switched the import function to one from haven. So finally it looks correct! CONT..(pt1) – Amanda Jan 05 '21 at 14:03
  • Regarding the pick last date function, since I copied in this code I keep getting this dplyr error (https://stackoverflow.com/questions/62842388/dplyr-warns-is-not-empty) which I've been trying to fix. If I just uninstall the package and restart, using only baseR script, it keeps writing "did you mispecify an argument?" for every line of code I have (in blue! not even an error) hahaha I closed R, closed my computer, opened it again. When I ran the baseR solution, there were still multiple IDCs (in the example above, would be multiple rows with IDC 1) and the dplyr error... – Amanda Jan 05 '21 at 14:45
  • Since you linked to that question (which I answered there), did you try the two recommendations? Frankly, it's a moot point since https://github.com/tidyverse/tibble/issues/798#issuecomment-661574331, where they announced that the fixed version of `tibble` is on CRAN. Try `install.packages(c("pillar","tibble"))`, then restart R, then try again. – r2evans Jan 05 '21 at 18:27
  • 1
    Sorry for the delay in replying. Even when I installed the package tibble, the error persisted. Every time I started this code. I asked my supervisor to run it to see if the same happened -- it didn't, everything worked. She saved a new version of the code that's identical and the workspace so now I finally have the dataset with the last date as I wanted. So that first code using only Base R works! Thank you so much for all the help! – Amanda Jan 14 '21 at 18:06