0

I'm trying to automate my data cleaning process. My dataset looks like this:

ADDRESS PHONE TYPE 123 Willow Street 7429947 RESIDENTIAL 123 Willow Street 7426629 RESIDENTIAL 234 Butter Road 7564123 RESIDENTIAL It's quite large - several hundred thousand rows. I'd like to be able to do the following thing:

(1) Duplicate Detection, so I can eliminate the "nearly"-duplicate rows.
(2) Create a new column for the non-duplicated data - something like PHONE 2. The issue is that I cannot know beforehand whether or not there are only 2 duplicate rows - could be n.

The outcome would hopefully be something like this:

ADDRESS PHONE PHONE 2 TYPE 123 Willow Street 7429947 7426629 RESIDENTIAL 234 Butter Road 7564123 RESIDENTIAL

I'd love to do this with dplyr, but I'm sort of at a loss as to where to start. Any pointers?

Kyle Shank
  • 77
  • 1
  • 10
  • What is the expected output in the above case? Which row do you keep? What makes something a prospective duplicate? – Gopala Feb 10 '17 at 18:31
  • @Gopala - Thanks - I edited to include an expected output. – Kyle Shank Feb 10 '17 at 18:33
  • Is phone the only column to handle? What if the address is slightly different? Your question / requirement is still not clear. – Gopala Feb 10 '17 at 18:35
  • @Gopala - The only value that may be different is in the PHONE column. Address/Type will always be consistent - just duplicated rows may exist because of the different value for PHONE. – Kyle Shank Feb 10 '17 at 18:36
  • This looks like a reshape wide to me. Take a look at [this post](http://stackoverflow.com/questions/5890584/how-to-reshape-data-from-long-to-wide-format) for other options. – lmo Feb 10 '17 at 19:09

1 Answers1

2

Using a combination of dplyr and tidyr, you can do something like this:

library(dplyr)
library(tidyr)
df <- data.frame(ADDRESS = c('123 Willow Street', '123 Willow Street',
                             '234 Butter Road'),
                 PHONE = c('7429947', '7426629', '7564123'),
                 TYPE = 'RESIDENTIAL', stringsAsFactors = FALSE)
df %>%
  group_by(ADDRESS, TYPE) %>%
  mutate(PHONEID = paste0('PHONE', row_number())) %>%
  spread(PHONEID, PHONE)

Output will be as follows:

Source: local data frame [2 x 4]
Groups: ADDRESS, TYPE [2]

            ADDRESS        TYPE  PHONE1  PHONE2
*             <chr>       <chr>   <chr>   <chr>
1 123 Willow Street RESIDENTIAL 7429947 7426629
2   234 Butter Road RESIDENTIAL 7564123    <NA>
Gopala
  • 10,363
  • 7
  • 45
  • 77