0

Let's say I have the following data frame:

df <- data.frame(address=c('654 Peachtree St','890 River Rd','890 River Rd','890 River Rd','1234 Main St','1234 Main St','567 1st Ave','567 1st Ave'), city=c('Atlanta','Eugene','Eugene','Eugene','Portland','Portland','Pittsburgh','Etna'), state=c('GA','OR','OR','OR','OR','OR','PA','PA'), zip5=c('30308','97404','97404','97404','97201','97201','15223','15223'), zip9=c('30308-1929','97404-3253','97404-3253','97404-3253','97201-5717','97201-5000','15223-2105','15223-2105'), stringsAsFactors = FALSE)
`address              city    state zip5  zip9
1 654 Peachtree St    Atlanta    GA 30308 30308-1929
2    8910 River Rd     Eugene    OR 97404 97404-3253
3    8910 River Rd     Eugene    OR 97404 97404-3253
4    8910 River Rd     Eugene    OR 97404 97404-3253
5     1234 Main St   Portland    OR 97201 97201-5717
6     1234 Main St   Portland    OR 97201 97201-5000
7      567 1st Ave Pittsburgh    PA 15223 15223-2105
8      567 1st Ave       Etna    PA 15223 15223-2105`

I'm considering any rows with a matching address and zip5 to be duplicates.

Filtering out or keeping duplicates based on these two columns is simple enough in R. What I'm trying to do is create a new column with a conditional label for each set of duplicates, ending up with something similar to this:

`address        city    state    zip5  zip9      type 
1 8910 River Rd Eugene     OR    97404 97404-3253 Exact Match 
2 8910 River Rd Eugene     OR    97404 97404-3253 Exact Match 
3 8910 River Rd Eugene     OR    97404 97404-3253 Exact Match 
4  1234 Main St Portland   OR    97201 97201-5717 Different Zip9 
5  1234 Main St Portland   OR    97201 97201-5000 Different Zip9 
6  567 1st Ave  Pittsburgh PA    15223 15223-2105 Different City 
7  567 1st Ave  Etna       PA    15223 15223-2105 Different City`

(I'd also be fine with a True/False column for each type of duplicate.)

I'm assuming the solution will be in some mutate+ifelse+boolean code, but I think it's the comparing within each duplicate subset that has me stuck...

Any advice?

Edit: I don't believe this is a duplicate of Find duplicated rows (based on 2 columns) in Data Frame in R. I can use that solution to create a T/F column for each type of duplicate/group_by match, but I'm trying to create exclusive categories. How could my conditions also take differences into account? The exact match rows should show true only on the "exact match" column, and false for every other column. If I define my columns simply by feeding different combinations of columns to group_by, the exact match rows will never return a False.

yaxx0r
  • 149
  • 3
  • 15
  • I disagree about the duplicates. The point here, as the edit says, is identifying the columns where differences exist, not getting rid of duplicates. – GenesRus Oct 04 '19 at 23:34
  • Yes, you are right. After your edit, looks like different to me as well. – gavg712 Oct 05 '19 at 01:20

1 Answers1

0

I think the key is grouping by "reference" variable--here address makes sense--and then you can count the number of unique items in that vector. It's not a perfect solution since my use of case_when will prioritize earlier options (i.e. if there are two different cities attributed to one address AND two different zip codes, you'll only see that there are two different cities--you will need to address this if it matters with additional case_when statements). However, getting the length of unique items is a reasonable heuristic in this case if you don't need a perfectly granular solution.

df %>% 
  group_by(address) %>% 
  mutate( 
    match_type = case_when(
      all(
        length(unique(city))  == 1,
        length(unique(state)) == 1,
        length(unique(zip5))  == 1,
        length(unique(zip9))  == 1) ~ "Exact Match",
      length(unique(city)) > 1      ~ "Different City",
      length(unique(state)) > 1     ~ "Different State",
      length(unique(zip5)) > 1      ~ "Different Zip5",
      length(unique(zip9)) > 1      ~ "Different Zip9"
    ))

Otherwise, you'll have to do iterative grouping (address + other variable) and mutate in a Boolean column as you alluded to.

Edit

One additional approach I just thought of if you need a more granular solution is to utilize the addition of an id column (df %>% rowid_to_column("ID")) and then a full join of the table to itself by address with suffixes (e.g. suffix = c("a","b")), filtering out same IDs and calling distinct (since each comparison is there twice), and then you can make Boolean columns with mutate for the pairwise comparisons. It may be too computationally intensive, depending on the size of your dataset, but it should work on the scale of a few thousand if you have a reasonable amount of RAM.

Community
  • 1
  • 1
GenesRus
  • 1,057
  • 6
  • 16