In R, looking to remove duplicates from certain rows, and combine rows on others

Question

I have a dataframe that has 5 columns. I want to remove duplicates based on "OPP_ID" column but want to merge the records for the last two columns "Sales" and "Marketing". The last two columns have NAs as well. I've tried several approaches but I have not gotten the desired effect.

Here is the initial Table

|  Name       |  Company   |  Opp_id  |  Sales  | Marketing
|  John S.    |  Amazon    |  12354   |  Yes    |  NA 
|  Bill W.    |  Google    |  15566   |  NA     |  Yes
|  Darryl W.  |  Facebook  |  98456   |  NA     |  Yes
|  Darryl W.  |  Facebook  |  98456   |  Yes    |  NA
|  Tom S.     |  Zillow    |  87423   |  NA     |  Yes
|  Tom S.     |  Zillow    |  87423   |  Yes    |  NA
|  Tom S.     |  Zillow    |  87423   |  Yes    |  NA

Here is the desired result table:

|  Name      |  Company    |  Opp_ID   |  Sales |  Marketing
|  John S.   |  Amazon     |  12354    |  Yes   |  NA
|  Bill W.   |  Google     |  15566    |  NA    |  Yes
|  Darryl W. |  Facebook   |  98456    |  Yes   |  Yes
|  Tom S.    |  Zillow     |  87423    |  Yes   |  NA

DPH · Answer 1 · 2021-03-17T10:35:01.707

If I understand your problem correctly this is a possible solution with dplyr:

# reading in the data you supplied ( I removed the leading | )
library(data.table)
df <- data.table::fread(" Name       |  Company   |  Opp_id  |  Sales  | Marketing
John S.    |  Amazon    |  12354   |  Yes    |  NA 
Bill W.    |  Google    |  15566   |  NA     |  Yes
Darryl W.  |  Facebook  |  98456   |  NA     |  Yes
Darryl W.  |  Facebook  |  98456   |  Yes    |  NA
Tom S.     |  Zillow    |  87423   |  NA     |  Yes
Tom S.     |  Zillow    |  87423   |  Yes    |  NA
Tom S.     |  Zillow    |  87423   |  Yes    |  NA")

# calculations
library(dplyr)
df %>% 
  # If value is na then convert to FALSE else to TRUE for two columns as the same function is needed
  dplyr::mutate(across(c(Sales, Marketing), ~ifelse(is.na(.x), FALSE, TRUE))) %>% 
  # Build the grouping (I am supposing they are 100% matching, else keep only Opp_id)
  dplyr::group_by(Name, Company, Opp_id) %>% 
  # In summarise comprise the grouping to unique combinations of names variables and any delivers TRUE if at least one TRUE is found in the group
  dplyr::summarise(across(c(Sales, Marketing), ~ any(.x))) %>%
  # always safer to remove the grouping unless you need it specificaly
  dplyr::ungroup() 

 # be aware that the output war reordered
  Name      Company  Opp_id Sales Marketing
  <chr>     <chr>     <int> <lgl> <lgl>    
1 Bill W.   Google    15566 FALSE TRUE     
2 Darryl W. Facebook  98456 TRUE  TRUE     
3 John S.   Amazon    12354 TRUE  FALSE    
4 Tom S.    Zillow    87423 TRUE  TRUE

Thanks, although I'm getting a 'lifecycle' package error when I run the code, I've loaded the package/library, still getting this error for some reason. Any thoughts? Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) : there is no package called ‘lifecycle’ — indy anahh, Mar 17 '21 at 07:43
my best guess would be to re-install the tidyverse and/or dplyr package as well as the lifecycle package (maybe it already comes with the tidyverse) — DPH, Mar 17 '21 at 10:31

akrun · Answer 2 · 2021-03-17T02:16:15.570

We can group by the 'Name', 'Company', 'Opp_id', and summarise across the columns 'SALES', 'MARKETING' by selecting the first element after ordering by a logical vector

library(dplyr)
df1 %>%
    group_by(Name, Company, Opp_id) %>% 
    summarise(across(c(Sales, Marketing),
       ~ first(.[order(is.na(.))])), .groups = 'drop')

-output

# A tibble: 4 x 5
#  Name      Company  Opp_id Sales Marketing
#* <chr>     <chr>     <int> <chr> <chr>    
#1 Bill W.   Google    15566 <NA>  Yes      
#2 Darryl W. Facebook  98456 Yes   Yes      
#3 John S.   Amazon    12354 Yes   <NA>     
#4 Tom S.    Zillow    87423 Yes   Yes

data

df1 <- structure(list(Name = c("John S.", "Bill W.", "Darryl W.", "Darryl W.", 
"Tom S.", "Tom S.", "Tom S."), Company = c("Amazon", "Google", 
"Facebook", "Facebook", "Zillow", "Zillow", "Zillow"), Opp_id = c(12354L, 
15566L, 98456L, 98456L, 87423L, 87423L, 87423L), Sales = c("Yes", 
NA, NA, "Yes", NA, "Yes", "Yes"), Marketing = c(NA, "Yes", "Yes", 
NA, "Yes", NA, NA)), class = "data.frame", row.names = c(NA, 
-7L))

In R, looking to remove duplicates from certain rows, and combine rows on others

2 Answers2

data