1

I have a data set with some ambiguous end dates. Since I cannot decide which one is correct I would like to remove them from the data frame, but cannot figure out the way.

Here is a sample df:

ID = as.integer(c(1,1,2,2,2,3,3,4,5,5,6,6))
Feature = c("A","A","A","A","A","A","B","B","B","B","B","C")
From = as.Date(c("2015-01-01","2015-01-01","2015-01-01","2015-01-01","2015-01-01","2015-01-01","2015-01-01","2015-01-01","2015-01-01","2016-01-01","2015-01-01","2015-01-01"))
To = as.Date(c("2016-01-01", NA, "2015-01-01", "2016-01-01", "2017-01-01", "2016-01-01", "2017-01-01", "2016-01-01","2016-01-01","2017-01-01","2016-01-01","2016-01-01"))
df = data.frame(ID, Feature, From, To)


#which looks like this:

   ID Feature       From         To
1   1       A 2015-01-01 2016-01-01
2   1       A 2015-01-01       <NA>
3   2       A 2015-01-01 2015-01-01
4   2       A 2015-01-01 2016-01-01
5   2       A 2015-01-01 2017-01-01
6   3       A 2015-01-01 2016-01-01
7   3       B 2015-01-01 2017-01-01
8   4       B 2015-01-01 2016-01-01
9   5       B 2015-01-01 2016-01-01
10  5       B 2016-01-01 2017-01-01
11  6       B 2015-01-01 2016-01-01
12  6       C 2015-01-01 2016-01-01

I would like to remove all ambigous cases that are duplicated on each variable except for the last one (ID 1 and 2 are such cases). Any other variance or duplicity is tolerated in the data set.

EDIT: Perhaps, I should specify that the Feature variable means a certain disadvantage on the labour market (such as being disabled, lone parent, young graduate with no work experience, etc). So one person can have multiple disadvantages and these can occur in multiple times. I edited the original sample df to allow for such variance.

My ideal sample df would keep these cases:

   ID Feature       From         To
6   3       A 2015-01-01 2016-01-01
7   3       B 2015-01-01 2017-01-01
8   4       B 2015-01-01 2016-01-01
9   5       B 2015-01-01 2016-01-01
10  5       B 2016-01-01 2017-01-01
11  6       B 2015-01-01 2016-01-01
12  6       C 2015-01-01 2016-01-01

I have been trying to look at other SO questions on duplicated and distinct functions, but could not find a similar post. I think my problem is different than the one described in this post, because I do not care about the number of cases (features) retained in my data set, as long as their dates do not contradict. By condradiction I mean that a feature was identified twice, has the same starting date, but different end dates. In those cases, I do not know which one to select, so I prefer to remove them entirely.

I have been also playing around with the functions, e.g. like this:

select = !duplicated(df[,1:3])
df[select,]

but cannot find a way how to remove both pairs of a duplicated case, and not just the second one. Thank you in advance for any tips!

malasi
  • 61
  • 6
  • If you want to remove duplicates and all elements which have duplicates, one solution could be to create a `filter_cases` `data.frame` containing the duplicates. Then, you use this `filter_cases` to keep only the elements in your `original` `data.frame` that are **not** in the `filter_cases` `data.frame` – Seymour Apr 10 '18 at 09:57
  • With `data.table` you can do `DT[!DT[duplicated(DT)], on=names(DT)]` or https://stackoverflow.com/a/49408559/5414452 – jogo Apr 10 '18 at 10:03

2 Answers2

1

One way to achieve removal of all instances of duplicated rows is to reverse the order of the variable for the duplicated function, which always:

returns the index i of the first duplicated entry x[i]

Using this functionality, we can then combine the forward and reverse passes to remove all rows which contain duplicated data.

# first pass
s1 = !duplicated(df[,1:3])
# second pass on the data.frame with reversed order in each column
s2 = !duplicated(apply(df[,1:3], 2, rev))
# the second pass needs to be back-reversed to match the original df
df[s1 & rev(s2), ]
   ID Feature       From         To
 5  3       A 2015-01-01 2016-01-01
 6  3       B 2015-01-01 2017-01-01

Or we can use a more elegant solution that @dalloliogm pointed out, and apply duplicated with argument fromLast = TRUE.

s2 = !duplicated(df[,1:3], fromLast = TRUE)
df[s1 & s2, ]
nya
  • 2,138
  • 15
  • 29
  • 1
    you can use the fromLast option in duplicated instead of using apply and rev. – dalloliogm Apr 10 '18 at 10:22
  • the `df [s1 & s2,]` suggestion is elegant and works wonderfully even if a case has more than one ambiguous duplicate. Thank you both answerers for pointing out this creative use of `fromLast` argument. – malasi Apr 10 '18 at 14:35
1

I would like to remove all ambigous cases that are duplicated on each variable, except for the last one (ID 1 and 2), but retain any other type of variance, e.g if the ID has more than one feature (ID 3).

It would be a bit easier to answer you if you could make a longer example. It is not clear to me

First, identify which rows have more "variability", e.g. more than one feature per ID:

> library(tidyverse)
> df %>% group_by(ID) %>% mutate(n_features=n_distinct(Feature))
# A tibble: 6 x 5
# Groups:   ID [3]
     ID Feature       From         To n_features
  <int>  <fctr>     <date>     <date>      <int>
1     1       A 2015-01-01 2016-01-01          1
2     1       A 2015-01-01         NA          1
3     2       A 2015-01-01 2016-01-01          1
4     2       A 2015-01-01 2017-01-01          1
5     3       A 2015-01-01 2016-01-01          2
6     3       B 2015-01-01 2017-01-01          2

Second, remove all the duplicated rows, except those with "variability":

> df %>% 
       group_by(ID) %>% 
       mutate(n_features=n_distinct(Feature)) %>% 
       ungroup %>% 
       filter(
               (!duplicated(Feature, From, To) | !duplicated(Feature, From, To, fromLast=T))| n_features>1)
# A tibble: 2 x 5
     ID Feature       From         To n_features
  <int>  <fctr>     <date>     <date>      <int>
1     3       A 2015-01-01 2016-01-01          2
2     3       B 2015-01-01 2017-01-01          2
dalloliogm
  • 8,718
  • 6
  • 45
  • 55