I have two dataframes, dfa
and dfb
:
dfa <- data.frame(
gene_name = c("MUC16", "MUC2", "MET", "FAT1", "TERT"),
id = c(1:5)
)
dfb <- data.frame(
gene_name = c("MUC1", "MET; BLEP", "MUC21", "FAT", "TERT"),
id = c(6:10)
)
which look like this:
> dfa
gene_name id
1 MUC16 1
2 MUC2 2
3 MET 3
4 FAT1 4
5 TERT 5
> dfb
gene_name id
1 MUC1 6
2 MET; BLEP 7
3 MUC21 8
4 FAT 9
5 TERT 10
dfa
is my genes of interest list: I want to keep the dfb
rows where they appear, minding the digits (MUC1
is not MUC16
). My new_df
should look like this:
> new_df
gene_name id
1 MET; BLEP 7
2 TERT 10
My problem is that the regular dplyr::semi_join()
does exact matches, which doesn't take into account the fact that dfb$gene_names
can contain genes separated with "; "
. Meaning that with this example, "MET"
is not retained.
I tried to look into fuzzyjoin::regex_semi_join
, but I can't make it do what I want...
A tidyverse solution would be welcome. (Maybe with stringr
?!)
EDIT: Follow-up question...
How would I go about to do the reciprocal anti_join
? Simply changing semi_join
to anti_join
in this method doesn't work because the row MET; BLEP
is present when it shouldn't be...
Adding a filter(gene_name == new_col)
after the anti_join
works with the provided simple dataset, but if I twist it a little like this:
dfa <- data.frame(
gene_name = c("MUC16", "MUC2", "MET", "FAT1", "TERT"),
id = c(1:5)
)
dfb <- data.frame(
gene_name = c("MUC1", "MET; BLEP", "MUC21; BLOUB", "FAT", "TERT"),
id = c(6:10)
)
...then it doesn't anymore. Here and in my real-life dataset, dfa
doesn't contain semicolons, it's only one column of individual gene names. But dfb
contains a lot of information, and multiple combinations of semicolons...