I'm working with prokka annotation files who give me the protein product of a gene found in the uniprot database. Unfortunately, many genes are linked with multiple, very similar product names, e.g.
1%2C2-phenylacetyl-CoA epoxidase%2C subunit A
1%2C2 phenylacetyl-CoA epoxidase%2C subunit A
1%2C2-phenylacetyl CoA epoxidase%2C subunit A
1%2C2-Phenylacetyl CoA Epoxidase%2C subunit A
whereas these variants are actually different products
1%2C2-phenylacetyl-CoA epoxidase%2C subunit A
1%2C2-phenylacetyl-CoA epoxidase%2C subunit B
1%2C2-phenylacetyl-CoA epoxidase%2C subunit C
1%2C2-phenylacetyl-CoA epoxidase%2C subunit E
To avoid trouble when mapping my genes to their respective products, I decided to substitute all possible ambiguities and problematic characters such as "-" " " "/" with "@" and put all strings to lower case.
But would there be a way to search e.g. for
1%2C2-Phenylacetyl CoA Epoxidase%2C subunit A
including possible, closely related entries with standard unix tools as grep? I could not find an answer so far.