1

My string is "Escherichia coli str Nissle 1917" and i want to extract from a df all the rows containing a similar string in a specific column (column organism name), the result should be the following:

  # assembly_accession  bioproject    biosample     wgs_master refseq_category
1:      GCF_000333215.1 PRJNA224116 SAMEA2272139 CAPM00000000.1              na
2:      GCF_000714595.1 PRJNA224116 SAMN02794012                             na
3:      GCF_003546975.1 PRJNA224116 SAMN07451663                             na
4:      GCF_019967895.1 PRJNA224116 SAMN18749717                             na
    taxid species_taxid                organism_name infraspecific_name isolate
1: 316435           562 Escherichia coli Nissle 1917 strain=Nissle 1917
2: 316435           562 Escherichia coli Nissle 1917 strain=Nissle 1917
3: 316435           562 Escherichia coli Nissle 1917 strain=Nissle 1917
4: 316435           562 Escherichia coli Nissle 1917 strain=Nissle 1917

i tried with agrep but don't works because of "str" word.

is there a way to do a fuzzy match or something similar in order to extract these rows from my data frame given my input string?

Thanks a lot

Ste40
  • 11
  • 2

2 Answers2

0

This should work:

df[df$organism_name == "Escherichia coli Nissle 1917",]
pbraeutigm
  • 455
  • 4
  • 8
  • Thanks for your answer, unfortunately my string is "Escherichia coli str Nissle 1917", your solution would work for a perfect match. – Ste40 Jan 03 '22 at 12:34
  • 1
    You could then use a combination of {str_starts(x, "Escherichia coli") } and {str_ends(x, "Nissle 2017")} from library(stringr) – pbraeutigm Jan 03 '22 at 15:19
0
df[grepl("Escherichia coli.+Nissle 1917", df$organism_name), ]

The .+ means that any characters of any length separating Escherichia coli and Nissle 1917.

nya
  • 2,138
  • 15
  • 29