fuzzy search / approximate string matching with standard unix tools

Question

I'm working with prokka annotation files who give me the protein product of a gene found in the uniprot database. Unfortunately, many genes are linked with multiple, very similar product names, e.g.

1%2C2-phenylacetyl-CoA epoxidase%2C subunit A
1%2C2 phenylacetyl-CoA epoxidase%2C subunit A
1%2C2-phenylacetyl CoA epoxidase%2C subunit A
1%2C2-Phenylacetyl CoA Epoxidase%2C subunit A

whereas these variants are actually different products

1%2C2-phenylacetyl-CoA epoxidase%2C subunit A
1%2C2-phenylacetyl-CoA epoxidase%2C subunit B
1%2C2-phenylacetyl-CoA epoxidase%2C subunit C
1%2C2-phenylacetyl-CoA epoxidase%2C subunit E

To avoid trouble when mapping my genes to their respective products, I decided to substitute all possible ambiguities and problematic characters such as "-" " " "/" with "@" and put all strings to lower case.

But would there be a way to search e.g. for

1%2C2-Phenylacetyl CoA Epoxidase%2C subunit A

including possible, closely related entries with standard unix tools as grep? I could not find an answer so far.

Do you mean to say `1%2C2-Phenylacetyl CoA Epoxidase%2C subunit A` and `1%2C2-phenylacetyl CoA epoxidase%2C subunit A` are different? — Inian, Apr 25 '18 at 14:25
Is your requirement is to match _exactly_ the string `1%2C2-Phenylacetyl CoA Epoxidase%2C subunit A` in the file? — Inian, Apr 25 '18 at 14:26
the first set of 4 names represent the same product, but are slightly different deposited in the database. the latter 4 names are actually different. I would like to use e.g. one of the first 4 strings to also find the other 3 versions. Hope this helps — crazysantaclaus, Apr 25 '18 at 14:30

score 1 · Accepted Answer · answered Apr 25 '18 at 15:01

1

If you want true fuzzy search, defined by string distance metrics, check out tre-agrep. For your application, I would use grep with case-insensitive matching and period special characters.

grep -i "1.2C2.phenylacetyl.CoA.epoxidase.2C subunit A" drugNames.txt

will match any character in the place of periods, and does not pay attention to case, which is what you want.

answered Apr 25 '18 at 15:01

hhoke1

222
1
9

thanks, I guess I will have a look into tre-agrep because for different queries I would always need to know were possible ambigous characters are – crazysantaclaus Apr 26 '18 at 14:15

fuzzy search / approximate string matching with standard unix tools

1 Answers1