Extract rows from a data frame in R based on fuzzy match string

Question

My string is "Escherichia coli str Nissle 1917" and i want to extract from a df all the rows containing a similar string in a specific column (column organism name), the result should be the following:

  # assembly_accession  bioproject    biosample     wgs_master refseq_category
1:      GCF_000333215.1 PRJNA224116 SAMEA2272139 CAPM00000000.1              na
2:      GCF_000714595.1 PRJNA224116 SAMN02794012                             na
3:      GCF_003546975.1 PRJNA224116 SAMN07451663                             na
4:      GCF_019967895.1 PRJNA224116 SAMN18749717                             na
    taxid species_taxid                organism_name infraspecific_name isolate
1: 316435           562 Escherichia coli Nissle 1917 strain=Nissle 1917
2: 316435           562 Escherichia coli Nissle 1917 strain=Nissle 1917
3: 316435           562 Escherichia coli Nissle 1917 strain=Nissle 1917
4: 316435           562 Escherichia coli Nissle 1917 strain=Nissle 1917

i tried with agrep but don't works because of "str" word.

is there a way to do a fuzzy match or something similar in order to extract these rows from my data frame given my input string?

Thanks a lot

score 0 · Answer 1 · answered Jan 03 '22 at 11:45

0

This should work:

df[df$organism_name == "Escherichia coli Nissle 1917",]

answered Jan 03 '22 at 11:45

pbraeutigm

455
4
8

Thanks for your answer, unfortunately my string is "Escherichia coli str Nissle 1917", your solution would work for a perfect match. – Ste40 Jan 03 '22 at 12:34
1

You could then use a combination of {str_starts(x, "Escherichia coli") } and {str_ends(x, "Nissle 2017")} from library(stringr) – pbraeutigm Jan 03 '22 at 15:19

score 0 · Answer 2 · answered Jan 25 '22 at 14:25

0

df[grepl("Escherichia coli.+Nissle 1917", df$organism_name), ]

The .+ means that any characters of any length separating Escherichia coli and Nissle 1917.

answered Jan 25 '22 at 14:25

nya

2,138
15
29

Extract rows from a data frame in R based on fuzzy match string

2 Answers2