We use zero-width regex strings to specify the places in a string of amino acid symbols (basically A-Z) that are valid cleavage sites. For example, the proteolytic enzyme trypsin cleaves after K or R except when followed by P ((?<=[KR])(?!P)
). I want to convert these regexes to the "cut/no-cut" notation also common in this field. For example, trypsin cuts after "KR" with a no-cut of "P". My first attempt at this works for simple cases:
// match zero or one regex term like (?<=[KR]) or (?<=K) or (?<![KR]) or (?<!K)
// followed by zero or one term like (?=[KR]) or (?=K) or (?![KR]) or (?!K)
boost::regex cutNoCutRegex("(?:\\(+\\?<([=!])(\\[[A-Z]+\\]|[A-Z])\\)+)?(?:\\(+\\?([=!])(\\[[A-Z]+\\]|[A-Z])\\)+)?");
Without the C++ escaping, that's:
(?:\(+\?<([=!])(\[[A-Z]+\]|[A-Z])\)+)?(?:\(+\?([=!])(\[[A-Z]+\]|[A-Z])\)+)?
I'd like to change this to support somewhat more complicated regexes, such as multiple characters, non-capturing groups, character sets, ranges in character sets, negated sets, and start/end of string:
(?<=K|R)
or (?<=(?:K)|(?:R))
or (?<=[^A-JL-QS-Z])
or (?<=^M|[KR])
These extra features would seem to explode the complexity of the regex. I'm pretty sure I'll need to enable the "experimental" BOOST_REGEX_MATCH_EXTRA feature of Boost.Regex. Is there a better way to do what I'm doing? Am I missing some other regex possibilities in zero-width regexes?
Here is pseudo-code for my unit tests for the existing code covering many of the simple cases. The "sense" member is "C" when the "cut" field corresponds to the look-behind, and "N" when the "cut" field corresponds to the look-ahead. The current pepXMLSpecificity() function can invert the character set if it would produce a shorter list.
struct PepXMLSpecificity { std::string cut, no_cut, sense; };
void unit_assert_equal(string expected, string actual);
"(?<=[QWERTY])(?=[QWERTY])"
result = pepXMLSpecificity(ez);
unit_assert_equal("C", result.sense);
unit_assert_equal("QWERTY", result.cut);
unit_assert_equal("ABCDFGHIJKLMNOPSUVZ", result.no_cut);
"(?<![QWERTY])(?![QWERTY])"
result = pepXMLSpecificity(ez);
unit_assert_equal("C", result.sense);
unit_assert_equal("ABCDFGHIJKLMNOPSUVZ", result.cut);
unit_assert_equal("QWERTY", result.no_cut);
"(?<=[QWERTY])"
result = pepXMLSpecificity(ez);
unit_assert_equal("C", result.sense);
unit_assert_equal("QWERTY", result.cut);
unit_assert_equal("", result.no_cut);
"(?=[QWERTY])"
result = pepXMLSpecificity(ez);
unit_assert_equal("N", result.sense);
unit_assert_equal("QWERTY", result.cut);
unit_assert_equal("", result.no_cut);
"(?<![QWERTY])"
result = pepXMLSpecificity(ez);
unit_assert_equal("C", result.sense);
unit_assert_equal("ABCDFGHIJKLMNOPSUVZ", result.cut);
unit_assert_equal("", result.no_cut);
"(?![QWERTY])"
result = pepXMLSpecificity(ez);
unit_assert_equal("N", result.sense);
unit_assert_equal("ABCDFGHIJKLMNOPSUVZ", result.cut);
unit_assert_equal("", result.no_cut);
// the following tests aren't supported yet
"(?<=^M)|(?<=[KR])"
unit_assert_equal("N", result.sense);
unit_assert_equal("KR", result.cut); // the 'M' part is dropped
unit_assert_equal("", result.no_cut);
"(?<=K|R)"
unit_assert_equal("C", result.sense);
unit_assert_equal("KR", result.cut);
unit_assert_equal("", result.no_cut);
"(?<=(?:K)|(?:R))"
unit_assert_equal("C", result.sense);
unit_assert_equal("KR", result.cut);
unit_assert_equal("", result.no_cut);
"(?<=[^A-JL-QS-Z])(?!P)"
unit_assert_equal("C", result.sense);
unit_assert_equal("KR", result.cut);
unit_assert_equal("P", result.no_cut);