1

I have a dataframe much like this one

ID     Col2  Col3  Col4  Col5 
ID_1   JDH   43    FJS   f8j
ID_8   ABC   73    *NA*  *NA*
ID_7   AFE   03    *NA*  k32
ID_8  *NA*   *NA*   FDS   z7d

where I have some rows with the same name/ID (ID_8 in this example) that have different values in the following columns. I now want to merge these rows into a new row ID_8_New without NAs. The dataframe should look something like this:

ID         Col2  Col3  Col4  Col5 
ID_1       JDH   43    FJS   f8j
ID_8       ABC   73    *NA*  *NA*
ID_7       AFE   03    *NA*  k32
ID_8      *NA*   *NA*   FDS   z7d
ID_8_New   ABC   73    FDS   z7d

Does anyone have any tips for me how I can accomplish this? My real dataframe has multiple of those rows. Thank you!

EDIT: Here is the dputstructure of a subset of my data

structure(list(Company = c("CompanyA", "CompanyA"), Ticker = c("A", 
"A"), Ticker_Unq = c("A1", 
"A2"), Description = c(NA, 
"Text"
), Type = c("Token", NA)), row.names = c(NA, -2L), class = c("tbl_df", 
"tbl", "data.frame"))
Soph2010
  • 563
  • 3
  • 13
  • You showed the expected output differently which I showed in my post and the accepted one doesn't have the same as in your expected – akrun Oct 19 '22 at 15:33
  • Sorry about that @akrun but your code did not work for me. I do not necessarily need the old ID_8 rows which is why I accepted the other answer. – Soph2010 Oct 19 '22 at 16:13
  • 1
    All good akrun, I will accept your answer. I am not sure why it is not working for me but I will accept it anyways. – Soph2010 Oct 19 '22 at 16:16

2 Answers2

3

We could filter

library(dplyr)
library(stringr)
library(tidyr)
df1 %>%
   group_by(ID) %>%
    filter(n() > 1) %>%
    summarise(across(everything(), ~ .x[order(is.na(.x))]), .groups = 'drop')%>%
    drop_na() %>%
    mutate(ID = str_c(ID, '_new')) %>%
    bind_rows(df1, .)

-output

       ID Col2 Col3 Col4 Col5
1     ID_1  JDH   43  FJS  f8j
2     ID_8  ABC   73 <NA> <NA>
3     ID_7  AFE    3 <NA>  k32
4     ID_8 <NA>   NA  FDS  z7d
5 ID_8_new  ABC   73  FDS  z7d

-output

df1 <- structure(list(ID = c("ID_1", "ID_8", "ID_7", "ID_8"),
 Col2 = c("JDH", 
"ABC", "AFE", NA), Col3 = c(43L, 73L, 3L, NA), Col4 = c("FJS", 
NA, NA, "FDS"), Col5 = c("f8j", NA, "k32", "z7d")),
 class = "data.frame", row.names = c(NA, 
-4L))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thank you for the input. Unfortunately, when I use my real dataframe and replace the respective values in the code the output is not correct. ID_8_new does not show up. – Soph2010 Oct 19 '22 at 16:13
  • @Soph2010 I am not sure about the reason. Here, I showed a reproducible example with `dput` structure so that anybody can test it. Your input structure may be slighly different – akrun Oct 19 '22 at 16:16
  • @Soph2010 it is better to use `dput` for showing data as this gives the structure correctly and there won't be any surprises. I am thinking that you may have some leading/lagging spaces in the ID column which makes the `"ID_8"` and `" ID_8"` different entries – akrun Oct 19 '22 at 16:18
  • would have used ```dput``` however my datafram is huge (over 200 columns) so not sure how helpful that would have been to read. There are no leading/lagged spaces as far as I can see. – Soph2010 Oct 19 '22 at 16:36
  • @Soph2010 I would use a subset of data and dput i.e. `dput(head(iris, 20))` if the columns are large as well `dput(yourdata[1:20, 1:5])` etc – akrun Oct 19 '22 at 16:37
  • Thank you, I edited my question and added a subset of the data - does that help? – Soph2010 Oct 20 '22 at 06:22
  • @Soph2010 In the new example, you have two rows and I assume `Company` is the 'ID'? Also, both rows are unique `"Alphr finance\r\n"` and `"Alphr"`. This was not the case in your earlier example – akrun Oct 20 '22 at 13:46
  • I edited the ```dput``` - you are right, Company is the ID. In my real dataset I have another column that has the Company with the correct writing in it, there both Companies are named the same – Soph2010 Oct 21 '22 at 08:21
  • @Soph2010 In the new data, you have `Ticker_Unq` having values 'A1' and 'A2'. If I use the same code here, it gives the output with first 'A1' row as the other gets removed with `drop_na` after reordering the NAs in last. Here, I am not clear whether youneed 'A2' as well or not? If so, we have to check for whether there are any non-NA elements in the row for filtering instead of drop_na – akrun Oct 21 '22 at 20:28
  • @Soph2010 i.e. either `df2 %>% group_by(Company) %>% filter(n() > 1) %>% summarise(across(everything(), ~ toString(unique(.x[!is.na(.x)]))), .groups = 'drop') %>% filter(if_any(-Company, complete.cases)) %>% mutate(Company = str_c(Company, '_new')) %>% bind_rows(df2, .)` or `df2 %>% group_by(Company) %>% filter(n() > 1) %>% summarise(across(everything(), ~ toString(unique(.x[!is.na(.x)]))), .groups = 'drop') %>% filter(if_any(-Company, complete.cases)) %>% mutate(Company = str_c(Company, '_new')) %>% bind_rows(df2, .)` (combine the non-Na to a single row string) – akrun Oct 21 '22 at 20:31
  • Thank you, however I seem to have some columns that can't be combined as I get the error ```Error in `bind_rows()`: ! Can't combine `..1$Date_Price_Crawled` > and `..2$Date_Price_Crawled` .``` and ```Error in bind_rows(df2, .) :``` (some of the columns I did not include in the dput as my dataset is too big. – Soph2010 Oct 24 '22 at 08:41
  • @Soph2010 Probably we can convert the columns to `character` and then bind i.e. `df2 %>% group_by(Company) %>% filter(n() > 1) %>% summarise(across(everything(), ~ toString(unique(.x[!is.na(.x)]))), .groups = 'drop') %>% filter(if_any(-Company, complete.cases)) %>% mutate(Company = str_c(Company, '_new')) %>% bind_rows(df2, .) or df2 %>% group_by(Company) %>% filter(n() > 1) %>% summarise(across(everything(), ~ toString(unique(.x[!is.na(.x)]))), .groups = 'drop') %>% filter(if_any(-Company, complete.cases)) %>% mutate(Company = str_c(Company, '_new')) %>%` contd – akrun Oct 24 '22 at 15:25
  • @Soph2010 `%>%mutate(across(everything(), as.character)) %>% bind_rows(df2 %>% mutate(across(everything(), as.character)), .) %>% type.convert(as.is = TRUE)` – akrun Oct 24 '22 at 15:26
3

You can return the maximum value per group:

library(dplyr)
dat %>% 
  group_by(ID) %>% 
  summarise(across(everything(), ~ max(.x, na.rm = T)))

output

# A tibble: 3 × 5
  ID    Col2  Col3  Col4  Col5 
  <chr> <chr> <chr> <chr> <chr>
1 ID_1  JDH   43    FJS   f8j  
2 ID_7  AFE   03    NA    k32  
3 ID_8  ABC   73    FDS   z7d 
Maël
  • 45,206
  • 3
  • 29
  • 67
  • Works great, thank you very much! One follow up questions: If I have values in both columns of the respective rows I want to merge, can I somehow control which value should be replaced/taken over? – Soph2010 Oct 19 '22 at 09:40
  • 1
    In this case, it's going to be the first one. It depends on what you want. It might be better to create a new question for this, since the answer might not be that straightforward. – Maël Oct 19 '22 at 09:42