I have a data set with some ambiguous end dates. Since I cannot decide which one is correct I would like to remove them from the data frame, but cannot figure out the way.
Here is a sample df:
ID = as.integer(c(1,1,2,2,2,3,3,4,5,5,6,6))
Feature = c("A","A","A","A","A","A","B","B","B","B","B","C")
From = as.Date(c("2015-01-01","2015-01-01","2015-01-01","2015-01-01","2015-01-01","2015-01-01","2015-01-01","2015-01-01","2015-01-01","2016-01-01","2015-01-01","2015-01-01"))
To = as.Date(c("2016-01-01", NA, "2015-01-01", "2016-01-01", "2017-01-01", "2016-01-01", "2017-01-01", "2016-01-01","2016-01-01","2017-01-01","2016-01-01","2016-01-01"))
df = data.frame(ID, Feature, From, To)
#which looks like this:
ID Feature From To
1 1 A 2015-01-01 2016-01-01
2 1 A 2015-01-01 <NA>
3 2 A 2015-01-01 2015-01-01
4 2 A 2015-01-01 2016-01-01
5 2 A 2015-01-01 2017-01-01
6 3 A 2015-01-01 2016-01-01
7 3 B 2015-01-01 2017-01-01
8 4 B 2015-01-01 2016-01-01
9 5 B 2015-01-01 2016-01-01
10 5 B 2016-01-01 2017-01-01
11 6 B 2015-01-01 2016-01-01
12 6 C 2015-01-01 2016-01-01
I would like to remove all ambigous cases that are duplicated on each variable except for the last one (ID 1 and 2 are such cases). Any other variance or duplicity is tolerated in the data set.
EDIT: Perhaps, I should specify that the Feature variable means a certain disadvantage on the labour market (such as being disabled, lone parent, young graduate with no work experience, etc). So one person can have multiple disadvantages and these can occur in multiple times. I edited the original sample df to allow for such variance.
My ideal sample df would keep these cases:
ID Feature From To
6 3 A 2015-01-01 2016-01-01
7 3 B 2015-01-01 2017-01-01
8 4 B 2015-01-01 2016-01-01
9 5 B 2015-01-01 2016-01-01
10 5 B 2016-01-01 2017-01-01
11 6 B 2015-01-01 2016-01-01
12 6 C 2015-01-01 2016-01-01
I have been trying to look at other SO questions on duplicated and distinct functions, but could not find a similar post. I think my problem is different than the one described in this post, because I do not care about the number of cases (features) retained in my data set, as long as their dates do not contradict. By condradiction I mean that a feature was identified twice, has the same starting date, but different end dates. In those cases, I do not know which one to select, so I prefer to remove them entirely.
I have been also playing around with the functions, e.g. like this:
select = !duplicated(df[,1:3])
df[select,]
but cannot find a way how to remove both pairs of a duplicated case, and not just the second one. Thank you in advance for any tips!