How do I de-duplicate while merging this data in R?

Question

Goal: Merge two excel files that will have significant overlap, but overwrite ONLY the phone numbers and record ID of one data set.

What I have been doing: Just brute force de-dup in excel where I copy over the sheet with phone numbers, sort the ID column, identify/highlight duplicates, and drag "up" the phone numbers to fill in the empty space for the matching ID. The process isn't hard, but with more records it starts to get absurdly tedious. These data look like this in the merged but not de-duplicated Excel file. Or just in plain text:

555555 |    Joe  |  Copy    |   DOB  |AGE | 555 Data Road | DataVille | LA |ZIP|County|(**PHONE GOES HERE**) |Male|White|Doc Name|More info

555555| Joe| Copy |DOB| AGE|    555 Data Road|DataVille|LA| ZIP|County| 555555555 (Phone)

And the phone should be added to the space between County and Gender for every record that matches the two ID's (first number in the record).

Attempts in R:

df_final <- merge(df_noPhone, df_Phone, by = c("Record_ID"), all.x = T)

But this just duplicates the columns ("PatientAddress.x" etc.) and I need those to be synced up for the records to be complete.

The real tricky part is, though that it isn't consistent this way throughout the data. Sometimes we simply don't have phone numbers for certain records and we still want to retain them in the data.

Suggestions? I've tried merging with almost every package I can imagine but sometimes it ends up creating more work in the direct, raw data file afterwards than it's worth.

Thanks!

Please provide sample data using `dput(...)` or `data.frame(...)`. — r2evans, Jun 10 '20 at 15:58
In particular: it may be helpful if you could add sample data (with `dput()`, as suggested by r2evans) showing the corner cases you're worried about. For instance, can you illustrate the sentence, "Sometimes we simply don't have phone numbers for certain records and we still want to retain them in the data," with a reproducible example? Similarly, for the (non-reproducible) example you did post, could you post the desired output instead of just describing it? — Aaron Montgomery, Jun 10 '20 at 16:19

score 0 · Answer 1 · answered Jun 10 '20 at 19:03

You'd mentioned :

.. identify/highlight duplicates, and drag "up" the phone numbers to fill in the empty space for the matching ID.

I suggest : replace the "drag up", with a formula.. then swap the column.

Assuming your data is filled in A2:S3, put :

=IF(M2="",1,0) in U2
=IF(U2=1,INDEX(M:M,MATCH(1,INDEX((0=U:U)*(A2=A:A),0,1),0)),"No data") in V2

and drag both downwards.

ref link : https://exceljet.net/formula/index-and-match-with-multiple-criteria

You'll noticed that in I use "no data" to 'fill' in column that already have numbers. you may use data > filter to remove/unselect those lines manually. (That's what I'll do.. but still it's = up to you.. )

Hope it helps..

How do I de-duplicate while merging this data in R?

1 Answers1