I'm relatively new to R, so I realise this type of question is asked often but I've read a lot of stack overflow posts and still can't quite get something to work on my data.
I have data on spss, in two datasets imported into R. Both of my datasets include an id (IDC), which I have been using to merge them. Before merging, I need to filter one of the datasets to select specifically the last observation of a date variable.
My dataset, d1, has a longitudinal measure in long format. There are multiple rows per IDC, representing different places of residence (neighborhood). Each row has its own "start_date", which is a variable that is NOT necessarily unique.
As it looks on spss :
IDC | neighborhood | start_date |
---|---|---|
1 | 22 | 08.07.2001 |
1 | 44 | 04.02.2005 |
1 | 13 | 21.06.2010 |
2 | 44 | 24.12.2014 |
2 | 3 | 06.03.2002 |
3 | 22 | 04.01.2006 |
4 | 13 | 08.07.2001 |
4 | 2 | 15.06.2011 |
In R, the start dates do not look the same, instead they are one long number like "13529462400". I do not understand this format but I assume it still would retain the date order...
Here are all my attempts so far to select the last date. All attempts ran, there was no error. The output just didn't give me what I want. To my perception, none of these made any change in the number of repetitions of IDC, so none of them actually selected *only the last date.
##### attempt 1 --- not working
d1$start_date_filt <- d1$start_date
d1[order(d1$IDC,d1$start_date_filt),] # Sort by ID and week
d1[!duplicated(d1$IDC, fromLast=T),] # Keep last observation per ID)
###### attempt 2--- not working
myid.uni <- unique(d1$IDC)
a<-length(myid.uni)
last <- c()
for (i in 1:a) {
temp<-subset(d1, IDC==myid.uni[i])
if (dim(temp)[1] > 1) {
last.temp<-temp[dim(temp)[1],]
}
else {
last.temp<-temp
}
last<-rbind(last, last.temp)
}
last
##### atempt 3 -- doesn't work
do.call("rbind",
by(d1,INDICES = d1$IDC,
FUN=function(DF)
DF[which.max(DF$start_date),]))
#### attempt 4 -- doesnt work
library(plyr)
ddply(d1,.(IDC), function(X)
X[which.max(X$start_date),])
### merger code -- in case something has to change with that after only the last start_date is selected
merge(d1,d2, IDC)
My goal dataset d1 would look like this:
IDC | neighborhood | start_date |
---|---|---|
1 | 13 | 21.06.2010 |
2 | 44 | 24.12.2014 |
3 | 22 | 04.01.2006 |
4 | 2 | 15.06.2011 |
I'm grateful for any help, many thanks <3