0

I have strings in file1 that matches part of the strings in file2. I want to filter out the strings from file2 that partly matches those in file1. Please see my try. Not sure how to define substring match in this way.

file1:

V1
species1
species121
species14341

file2

V1
genus1|species1|strain1
genus1|species121|strain1
genus1|species1442|strain1
genus1|species4242|strain1
genus1|species4131|strain1

my try:

file1[!file1$V1 %in% file2$V1]
user2300940
  • 2,355
  • 1
  • 22
  • 35
  • You could try to read in both files as dataframes and **fuzzy join** those (https://stackoverflow.com/search?q=%5Br%5D+fuzzy+join+dataframes). Choose the appropriate join direction to keep only matching rows. –  May 09 '22 at 10:54

2 Answers2

2

One way to get what you want is using the grepl function. So, you can run the following code:

# Load library
  library(qdapRegex)
# Extract the names of file2$V1 you are interested in (those between | |)
  v <- unlist(rm_between(file2$V1, "|", "|", extract = T))
# Which of theese elements are in file1$V1?
  elem.are <- which(v %in% file1$V1)
# Delete the elements in elem.are
  file2$V1[-elem.are]
  1. In v we save the names of file2$V1 we are interested in (those between | |)

  2. Then we save at elem.are the positions of those names which appear in file1$V1

  3. Finally, we omit those elements using file2$V1[-elem.are]

R18
  • 1,476
  • 1
  • 8
  • 17
  • Sorry, I meant opposite. Want to remove from file2, the elements found in file1. I guess just switching the filenames works – user2300940 May 09 '22 at 12:00
  • Although the change in your approach is samll, the code changes a lot. I have edited my message, and now I think it answers your question. – R18 May 09 '22 at 12:37
  • Thanks. What I am interested in the names also before | in file2? – user2300940 May 11 '22 at 10:06
  • Then, instead of `unlist(rm_between(file2$V1, "|", "|", extract = T))` you should write something like any of the answers in this post https://stackoverflow.com/questions/38291794/extract-string-before – R18 May 11 '22 at 10:16
1

You cannot use the %in% operator in this way in R. It is used to determine whether an element of a vector is in another vector, not like in in Python which can be used to match a substring: Look at this:

"species1" %in% "genus1|species1|strain1" # FALSE
"species1" %in% c("genus1", "species1", "strain1") # TRUE

You can, however, use grepl for this (the l is for logical, i.e. it returns TRUE or FALSE).

grepl("species1", "genus1|species1|strain1") # TRUE

There's an additional complication here in that you cannot use grepl with a vector, as it will only compare the first value:

grepl(file1$V1, "genus1|species1|strain1") 
[1] TRUE
Warning message:
In grepl(file1$V1, "genus1|species1|strain1") :
  argument 'pattern' has length > 1 and only the first element will be used

The above simply tells you that the first element of file1$V1 is in "genus1|species1|strain1".

Furthermore, you want to compare each element in file1$V1 to an entire vector of strings, rather than just one string. That's OK but you will get a vector the same length as the second vector as an output:

grepl("species1", file2$V1) 
[1]  TRUE  TRUE  TRUE FALSE FALSE

We can just see if any() of those are a match. As you've tagged your question with tidyverse, here's a dplyr solution:

library(dplyr)
 file1 |>
    rowwise() |> # This makes sure you only pass one element at a time to `grepl`
    mutate(
        in_v2 = any(grepl(V1, file2$V1)) 
    ) |>
    filter(!in_v2)

# A tibble: 1 x 2
# Rowwise: 
#   V1           in_v2
#   <chr>        <lgl>
# 1 species14341 FALSE
SamR
  • 8,826
  • 3
  • 11
  • 33
  • Sorry, I meant opposite. Want to remove from file2, the elements found in file1. I guess just switching the filenames works – user2300940 May 09 '22 at 12:00
  • Yes exactly - it's the same principle – SamR May 09 '22 at 12:30
  • @user2300940 You could also just do `file2[!sapply(file2$V1, \(x) any(grepl(x, file1$V1))),]`, provided that genus names are always separated by `|`, as this acts as the OR operator in regex. Incidentally feel free to accept either this answer or the one by @R18 - I think either is an acceptable approach. – SamR May 09 '22 at 12:50