Parsing a zero-width regex with a regex

Question

We use zero-width regex strings to specify the places in a string of amino acid symbols (basically A-Z) that are valid cleavage sites. For example, the proteolytic enzyme trypsin cleaves after K or R except when followed by P ((?<=[KR])(?!P)). I want to convert these regexes to the "cut/no-cut" notation also common in this field. For example, trypsin cuts after "KR" with a no-cut of "P". My first attempt at this works for simple cases:

// match zero or one regex term like (?<=[KR]) or (?<=K) or (?<![KR]) or (?<!K)
// followed by zero or one term like (?=[KR]) or (?=K) or (?![KR]) or (?!K)
boost::regex cutNoCutRegex("(?:\\(+\\?<([=!])(\\[[A-Z]+\\]|[A-Z])\\)+)?(?:\\(+\\?([=!])(\\[[A-Z]+\\]|[A-Z])\\)+)?");

Without the C++ escaping, that's:

(?:$+\?<([=!])(\[[A-Z]+\]|[A-Z])$+)?(?:$+\?([=!])(\[[A-Z]+\]|[A-Z])$+)?

I'd like to change this to support somewhat more complicated regexes, such as multiple characters, non-capturing groups, character sets, ranges in character sets, negated sets, and start/end of string: (?<=K|R) or (?<=(?:K)|(?:R)) or (?<=[^A-JL-QS-Z]) or (?<=^M|[KR])

These extra features would seem to explode the complexity of the regex. I'm pretty sure I'll need to enable the "experimental" BOOST_REGEX_MATCH_EXTRA feature of Boost.Regex. Is there a better way to do what I'm doing? Am I missing some other regex possibilities in zero-width regexes?

Here is pseudo-code for my unit tests for the existing code covering many of the simple cases. The "sense" member is "C" when the "cut" field corresponds to the look-behind, and "N" when the "cut" field corresponds to the look-ahead. The current pepXMLSpecificity() function can invert the character set if it would produce a shorter list.

struct PepXMLSpecificity { std::string cut, no_cut, sense; };
void unit_assert_equal(string expected, string actual);

"(?<=[QWERTY])(?=[QWERTY])"
result = pepXMLSpecificity(ez);
unit_assert_equal("C", result.sense);
unit_assert_equal("QWERTY", result.cut);
unit_assert_equal("ABCDFGHIJKLMNOPSUVZ", result.no_cut);

"(?<![QWERTY])(?![QWERTY])"
result = pepXMLSpecificity(ez);
unit_assert_equal("C", result.sense);
unit_assert_equal("ABCDFGHIJKLMNOPSUVZ", result.cut);
unit_assert_equal("QWERTY", result.no_cut);

"(?<=[QWERTY])"
result = pepXMLSpecificity(ez);
unit_assert_equal("C", result.sense);
unit_assert_equal("QWERTY", result.cut);
unit_assert_equal("", result.no_cut);

"(?=[QWERTY])"
result = pepXMLSpecificity(ez);
unit_assert_equal("N", result.sense);
unit_assert_equal("QWERTY", result.cut);
unit_assert_equal("", result.no_cut);

"(?<![QWERTY])"
result = pepXMLSpecificity(ez);
unit_assert_equal("C", result.sense);
unit_assert_equal("ABCDFGHIJKLMNOPSUVZ", result.cut);
unit_assert_equal("", result.no_cut);

"(?![QWERTY])"
result = pepXMLSpecificity(ez);
unit_assert_equal("N", result.sense);
unit_assert_equal("ABCDFGHIJKLMNOPSUVZ", result.cut);
unit_assert_equal("", result.no_cut);

// the following tests aren't supported yet

"(?<=^M)|(?<=[KR])"
unit_assert_equal("N", result.sense);
unit_assert_equal("KR", result.cut); // the 'M' part is dropped
unit_assert_equal("", result.no_cut);

"(?<=K|R)"
unit_assert_equal("C", result.sense);
unit_assert_equal("KR", result.cut);
unit_assert_equal("", result.no_cut);

"(?<=(?:K)|(?:R))"
unit_assert_equal("C", result.sense);
unit_assert_equal("KR", result.cut);
unit_assert_equal("", result.no_cut);

"(?<=[^A-JL-QS-Z])(?!P)"
unit_assert_equal("C", result.sense);
unit_assert_equal("KR", result.cut);
unit_assert_equal("P", result.no_cut);

It might be easier and faster to implement a parser than to get the regular expressions to work. — zneak, Jan 17 '12 at 17:58
@zneak: I thought about that, and it would probably be easier to read the code, but I don't think it would be any less complex. — Matt Chambers, Jan 17 '12 at 18:08
I'm confused by your use of the word "regex". I'm guessing you're using the word "regex" to refer both to the input data as well as the regular expression. Or are you really trying to parse a regex with a regex? — Mike Clark, Jan 17 '12 at 18:19
Let me rephrase your question: You have a regexp `$R` that you expect to be equivalent with `(?<$X)$Y`. You want to get '$X' and '$Y' from '$R'. This is possible for regular `$X` or `$Y`, although hard. Were you happy with a purely syntactic solution, you could probably use some simple parser (Boost.Spirit?, Yacc?) that could pick the two parts. — jpalecek, Jan 17 '12 at 18:22
Yes Mike, the input is a zero-width regex (see all the examples I gave). The parser is NOT a zero-width regex but a regular capturing regex. — Matt Chambers, Jan 17 '12 at 18:25
JPalecek, what do you mean "purely syntactic"? $X is a look-behind expression and $Y is a look-ahead expression, and only one or the other must be present (or both). — Matt Chambers, Jan 17 '12 at 18:29
@MattChambers: "Purely syntactic" means you expect your input regexes be syntactically of the same form (above), as opposed to semantic equivalence (eg. `(?<=X(?!.Z)(?!Y))` is not of your form, but is equivalent to `(?<=X)(?!YZ)`). If you're OK with NOT handling that input, you want a syntactic solution. — jpalecek, Jan 17 '12 at 18:35
If you really need to be able to handle *any* valid look-behind, then this is not possible in a true regular expression, because you'd need to handle recursive nesting. For example, `(?<=(?!L)[A-Z])` is a valid way of writing `(?<=[A-KM-Z])`. You can use something like http://boost-sandbox.sourceforge.net/libs/xpressive/doc/html/boost_xpressive/user_s_guide/grammars_and_nested_matches.html, but I'm really betting it'll be much more trouble than it's worth. I think you should take a step back and consider if you *really* need to take this approach. — ruakh, Jan 17 '12 at 18:36
@ruakh: No, I don't think I need to handle recursive zero-width expressions. I didn't even know that was valid. I do want to handle semantically equivalent forms, within reason. Your examples are using the nested zero-width expressions which I think are outside reasonable expectations for my input. — Matt Chambers, Jan 17 '12 at 20:19
@MattChambers: You should clarify your question then. Either you want parse syntactically given regular expression (easy *even with nested expressions and whatever*) or you want to reason about its meaning (hard to impossible). Or post a list of your expressions (if you just want to transform your data to another form) so we can see what you want. After 2 hours of commenting, we know **nothing** about your real question, just that you want something "within reason". So I'm voting to close as vague and fuzzy. — jpalecek, Jan 17 '12 at 20:57
@jpalecek: I don't understand how the examples I gave are vague. I want to parse `(?<=K|R)` or `(?<=(?:K)|(?:R))` or `(?<=[^A-JL-QS-Z])` or `(?<=^M|[KR])` into cut/no-cut semantics. Each of these cases except the last is equivalent to the first example (`(?<=[KR])`). The last adds the possibility to cleave after the first character of the string which is not representable with cut/no-cut (I would just drop the ^M semantics during the conversion). It's not an exhaustive list of expressions, but it pretty much defines what I consider "reasonable" to encounter in a zero-width cleavage regex. — Matt Chambers, Jan 17 '12 at 22:18
@MattChambers: You still don't get it. For parsing, it doesn't matter *what* you can encounter in a zero-width cleavage regex, because that doesn't make the problem any harder. However, it matters *where* you find the cut, resp. no-cut portion. If you know the former is always at the beginning and the no-cut is always at the end, it would be easy and making a parser that could split it wouldn't take more than an hour of work; that would be a purely syntactical approach. — jpalecek, Jan 17 '12 at 22:56
@MattChambers: The examples helped a lot, particularly the fact that the cut/no-cut always matches a single character. — jpalecek, Jan 17 '12 at 23:00
There's no way to translate that with full accuracy. I've pondered splitting such expressions into multiple specificities. If I keep it as a single specificity, there's no way to translate it accurately. It could be cut="AKR" no_cut="B" or cut="AKR" no_cut="". If I did multiple specificities, it could be one like cut="A" no_cut="B" and another like cut="KR" no_cut="". — Matt Chambers, Jan 17 '12 at 23:27
A more troublesome example is multiple residues. See this page for a table of enzymes in a more verbose form of cut/no_cut: http://web.expasy.org/peptide_cutter/peptidecutter_enzymes.html Enterokinase would be like `(?<=[DN]{3}K)`. Luckily nobody uses enterokinase. :) — Matt Chambers, Jan 17 '12 at 23:33

score 1 · Accepted Answer · answered Jan 19 '12 at 15:17

As I understand it the situation is, you have a library of existing regexes which, when applied to common string representations of aminio-acid sequences, identify probable cut-points for a proteolytic enzyme.

You want to automatically produce a standard textual description of the cutpoints implied by the regex.

Observations:

You don't need to be able to parse arbitrary regexes, you only need to be able to parse the cases that you actually have in your library.
You don't necessarily need to parse all of them. Particularly difficult ones could be kicked out and done by hand provided there weren't too many of them.

Really I think you need to do the following.

pepXMLSpecificity needs to return one or more descriptions, i.e. a vector<struct PepXMLSpecificity>, since a regex can be authored to combine arbitrary regexes cf. jpalacek's comment.
You should tackle the actual contents of your library of regexes starting with the common cases and working down, and just add special cases for each common type of regex until you have got them all (or at least enough of them to satisfy your boss).

Unfortunately there's not really a library outside the regexes that I already support. Users could enter any arbitrary zero-width regex though. I am beginning to think that it's ok to fail on some of the more complicated cases though - our alternative output format (not pepXML) supports the regexes natively so it's reasonable to make people use that for complex regexes. Thanks. — Matt Chambers, Jan 19 '12 at 18:02

Parsing a zero-width regex with a regex

1 Answers1