0

I'm working with prokka annotation files who give me the protein product of a gene found in the uniprot database. Unfortunately, many genes are linked with multiple, very similar product names, e.g.

1%2C2-phenylacetyl-CoA epoxidase%2C subunit A
1%2C2 phenylacetyl-CoA epoxidase%2C subunit A
1%2C2-phenylacetyl CoA epoxidase%2C subunit A
1%2C2-Phenylacetyl CoA Epoxidase%2C subunit A

whereas these variants are actually different products

1%2C2-phenylacetyl-CoA epoxidase%2C subunit A
1%2C2-phenylacetyl-CoA epoxidase%2C subunit B
1%2C2-phenylacetyl-CoA epoxidase%2C subunit C
1%2C2-phenylacetyl-CoA epoxidase%2C subunit E

To avoid trouble when mapping my genes to their respective products, I decided to substitute all possible ambiguities and problematic characters such as "-" " " "/" with "@" and put all strings to lower case.

But would there be a way to search e.g. for

1%2C2-Phenylacetyl CoA Epoxidase%2C subunit A

including possible, closely related entries with standard unix tools as grep? I could not find an answer so far.

crazysantaclaus
  • 613
  • 5
  • 19
  • Do you mean to say `1%2C2-Phenylacetyl CoA Epoxidase%2C subunit A` and `1%2C2-phenylacetyl CoA epoxidase%2C subunit A` are different? – Inian Apr 25 '18 at 14:25
  • Is your requirement is to match _exactly_ the string `1%2C2-Phenylacetyl CoA Epoxidase%2C subunit A` in the file? – Inian Apr 25 '18 at 14:26
  • the first set of 4 names represent the same product, but are slightly different deposited in the database. the latter 4 names are actually different. I would like to use e.g. one of the first 4 strings to also find the other 3 versions. Hope this helps – crazysantaclaus Apr 25 '18 at 14:30

1 Answers1

1

If you want true fuzzy search, defined by string distance metrics, check out tre-agrep. For your application, I would use grep with case-insensitive matching and period special characters.

grep -i "1.2C2.phenylacetyl.CoA.epoxidase.2C subunit A" drugNames.txt

will match any character in the place of periods, and does not pay attention to case, which is what you want.

hhoke1
  • 222
  • 1
  • 9
  • thanks, I guess I will have a look into tre-agrep because for different queries I would always need to know were possible ambigous characters are – crazysantaclaus Apr 26 '18 at 14:15