How to semi_join two dataframes by string column with one being colon-separated

Question

I have two dataframes, dfa and dfb:

dfa <- data.frame(
  gene_name = c("MUC16", "MUC2", "MET", "FAT1", "TERT"),
  id = c(1:5)
)

dfb <- data.frame(
  gene_name = c("MUC1", "MET; BLEP", "MUC21", "FAT", "TERT"),
  id = c(6:10)
)

which look like this:

> dfa
  gene_name id
1     MUC16  1
2      MUC2  2
3       MET  3
4      FAT1  4
5      TERT  5

> dfb
  gene_name id
1      MUC1  6
2 MET; BLEP  7
3     MUC21  8
4       FAT  9
5      TERT 10

dfa is my genes of interest list: I want to keep the dfb rows where they appear, minding the digits (MUC1 is not MUC16). My new_df should look like this:

> new_df
  gene_name id
1 MET; BLEP  7
2      TERT 10

My problem is that the regular dplyr::semi_join() does exact matches, which doesn't take into account the fact that dfb$gene_names can contain genes separated with "; ". Meaning that with this example, "MET" is not retained.

I tried to look into fuzzyjoin::regex_semi_join, but I can't make it do what I want...

A tidyverse solution would be welcome. (Maybe with stringr?!)

EDIT: Follow-up question...

How would I go about to do the reciprocal anti_join? Simply changing semi_join to anti_join in this method doesn't work because the row MET; BLEP is present when it shouldn't be...

Adding a filter(gene_name == new_col) after the anti_join works with the provided simple dataset, but if I twist it a little like this:

dfa <- data.frame(
  gene_name = c("MUC16", "MUC2", "MET", "FAT1", "TERT"),
  id = c(1:5)
)

dfb <- data.frame(
  gene_name = c("MUC1", "MET; BLEP", "MUC21; BLOUB", "FAT", "TERT"),
  id = c(6:10)
)

...then it doesn't anymore. Here and in my real-life dataset, dfa doesn't contain semicolons, it's only one column of individual gene names. But dfb contains a lot of information, and multiple combinations of semicolons...

Sada93 · Answer 1 · 2019-11-14T19:09:48.190

3

You can use seperate_rows() to split the dataframe before joining. Note that if BLEP existed in dfa, it would result in a duplicate, which is why distinct is used

dfa <- data.frame(
  gene_name = c("MUC16", "MUC2", "MET", "FAT1", "TERT"),
  id = c(1:5),
  stringsAsFactors = FALSE
)

dfb <- data.frame(
  gene_name = c("MUC1", "MET; BLEP", "MUC21", "FAT", "TERT"),
  id = c(6:10),
  stringsAsFactors = FALSE
)


library(tidyverse)

dfb%>%
  mutate(new_col = gene_name)%>%
  separate_rows(new_col,sep = "; ")%>%
  semi_join(dfa,by = c("new_col" = "gene_name"))%>%
  select(gene_name,id)%>%
  distinct()

edited Nov 14 '19 at 19:09

answered Nov 13 '19 at 15:10

Sada93

2,785
1
10
21

1

This could also do the job `... separate_rows(new_col,sep = "; ") %>% filter(new_col %in% dfa$gene_name) %>% ...` instead of joining. – AntoniosK Nov 13 '19 at 15:16
Thanks @Sada93 and @AntoniosK, your answers both work! @Sada93, if you modify `separate_rows(new_col,sep = ";")%>% mutate(new_col = gsub(" ","",new_col))%>%` to become only `separate_rows(new_col,sep = "; ")%>%` I'll be glad to accept your answer! – MonkeyBack Nov 13 '19 at 16:22
Follow-up question: how would I go about to do the reciprocal `anti_join`? Simply changing `semi_join` to `anti_join` in this method doesn't work because the row `MET; BLEP` is present when it shouldn't be... @AntoniosK – MonkeyBack Nov 14 '19 at 12:37
You can add a `filter(gene_name == new_col)` after the `anti_join` but I think this would only work if `dfa` is not represented by semicolon notation. If dfa can also contain semicolons we would need something fancier. – Sada93 Nov 14 '19 at 19:32
It works with the provided simple dataset, but if I twist it a little like this: `dfa <- data.frame( gene_name = c("MUC16", "MUC2", "MET", "FAT1", "TERT"), id = c(1:5) ) dfb <- data.frame( gene_name = c("MUC1", "MET; BLEP", "MUC21; BLOUB", "FAT", "TERT"), id = c(6:10) )` then it doesn't anymore. Here and in my real-life dataset, `dfa` doesn't contain semicolons, it's only one column of individual gene names. But `dfb` contains a lot of information, and multiple combinations of semicolons... – MonkeyBack Nov 18 '19 at 10:03
Should the previous comment be its own question/post? – MonkeyBack Nov 18 '19 at 12:22

score 0 · Answer 2 · answered Nov 13 '19 at 16:24

Here's a solution using stringr and purrr.

library(tidyverse)

dfb %>%
 mutate(gene_name_list = str_split(gene_name, "; ")) %>%
 mutate(gene_of_interest = map_lgl(gene_name_list, some, ~ . %in% dfa$gene_name)) %>%
 filter(gene_of_interest == TRUE) %>%
 select(gene_name, id)

MonkeyBack · Accepted Answer · 2019-11-18T17:34:10.643

I think I finally managed to make fuzzyjoin::regex_joins do what I want. It was ridiculously simple, I just had to tweak my dfa filter list:

library(fuzzyjoin)

# add "\b" regex expression before/after each gene of the list to filtrate from
# (to search for whole words)
dfa$gene_name <- paste0("\\b", dfa$gene_name, "\\b")

# to keep genes from dfb that are present in the dfa filter list
dfb %>% 
  regex_semi_join(dfa, by = c(gene_name = "gene_name"))

# to exclude genes from dfb that are present in the dfa filter blacklist
dfb %>% 
  regex_anti_join(dfa, by = c(gene_name = "gene_name"))

One drawback though: it's quite slow...

How to semi_join two dataframes by string column with one being colon-separated

3 Answers3