I'm trying to automate my data cleaning process. My dataset looks like this:
ADDRESS PHONE TYPE
123 Willow Street 7429947 RESIDENTIAL
123 Willow Street 7426629 RESIDENTIAL
234 Butter Road 7564123 RESIDENTIAL
It's quite large - several hundred thousand rows. I'd like to be able to do the following thing:
(1) Duplicate Detection, so I can eliminate the "nearly"-duplicate rows.
(2) Create a new column for the non-duplicated data - something like PHONE 2. The issue is that I cannot know beforehand whether or not there are only 2 duplicate rows - could be n.
The outcome would hopefully be something like this:
ADDRESS PHONE PHONE 2 TYPE
123 Willow Street 7429947 7426629 RESIDENTIAL
234 Butter Road 7564123 RESIDENTIAL
I'd love to do this with dplyr, but I'm sort of at a loss as to where to start. Any pointers?