lookup for values in two columns in the presence of duplicates in R (edited)

Question

As a newbie I'm struggling to solve the following issue with a big dataset. I would want to look for the value in the "CHROM" column comparing two columns of CLONEID (the second column contains duplicates). The corresponding "CHROM" value would be attributed to each "CLONEID" and same value would be attributed to its duplicate. I write this table as an example:

CLONEID   | CHROM
976803    | 2A
976877    | 5B
976952    | 6B
976961    | 3B
976975    | 1A
977084    | 7B
977228    | 4A
977241    | 3A

I would like to obtain the following output:

CLONEID   | CHROM
976803    |2A
976877    |5B
976952    |6B
976961    |3B
976975    |1A
977084    |7B
977084_1  |7B
977228    |4A
977228_1  |4A
977228_2  |4A
977228_3  |4A
977241    |3A

How do you decide which row need to be duplicated and how many times? — Ronak Shah, Mar 18 '20 at 01:09
It's difficult to understand what you are expecting because the first table does not contain most entries that appear in the second one, so we can't make a correlation and establish what the rules are. Maybe if you made a longer example. Also, are CLONEDID's being renamed with a trailing sequencing number when they have same CHROM? — user2332849, Mar 18 '20 at 01:24
@user2332849 Yes, I agree. Sorry, I did not include the first entry. I made the correction now. Yes, CLONEID's are renamed in a sequencing number when they pursue the same CHROM. — Marilou, Mar 18 '20 at 07:54
@RonakShah I know already the specific rows that should be duplicated. So, a code that could function like VLOOKUP which detects for example 977084 and attributes the CHROM to 977084 and 977084_... Basically, 977084=977084_1=977084_2. — Marilou, Mar 18 '20 at 08:02
Where is the information which tells `977084`. needs to be repeated 2 times and 977228 4 times. Why only 2 and 4 times and not 1, 3 or 10 times ? — Ronak Shah, Mar 18 '20 at 09:52
@RonakShah The column CLONEID of my output is already defined. The only thing that is needed is the correspondent CHROM. — Marilou, Mar 18 '20 at 12:15

score 0 · Answer 1 · answered Mar 18 '20 at 08:15

Okay I have made a hypothetical example with the partial data you provided.

library(dplyr)

df <- tribble(
  ~CLONEID, ~CHROM,
  '976803', '2A',
  '976877', '5B',
  '976952', '6B',
  '976961', '3B',
  '976975', '1A',
  '977084', '7B',
  '977228', '4A',
  '977241', '3A',
  '977252', '7B',
  '977303', '4A',
  '977315', '4A',
  '977422', '4A'
)

We group by CHROM, so we can identify duplicates and number them with the original CLONEID and a sequential, starting from 1. After the renaming, we reorder so the new CLONEID's appear together.

df <-
  df %>%
  group_by(CHROM) %>%
  mutate(CLONEID = ifelse(row_number() == 1, CLONEID, paste0(first(CLONEID), "_", row_number()-1))) %>%
  ungroup() %>%
  arrange(CLONEID)

print(df)

Output

   CLONEID  CHROM
   <chr>    <chr>
 1 976803   2A   
 2 976877   5B   
 3 976952   6B   
 4 976961   3B   
 5 976975   1A   
 6 977084   7B   
 7 977084_1 7B   
 8 977228   4A   
 9 977228_1 4A   
10 977228_2 4A   
11 977228_3 4A   
12 977241   3A

lookup for values in two columns in the presence of duplicates in R (edited)

1 Answers1