0

I want to write tests for a regular expression analysis engine. It would be nice if I could generate arbitrary pairs of equivalent regular expressions, to see whether the engine correctly parses them and identifies them as being equivalent. Is there any known algorithm for doing so?

I would also accept a list of 20-100 well-known regex equivalences, if anyone knows of a pre-created list. For example a*a and aa* or (ab)*a and a(ba)*.

ahelwer
  • 1,441
  • 13
  • 29

1 Answers1

2

The method I came up with was as follows - I assembled a list of simple regex transformations which preserved equivalence, for example (assuming a and b are equivalent):

  • f(a, b) ⩴ (a*a, bb*)
  • f(a, b) ⩴ (aa?, b?b)
  • f(a, b) ⩴ (ab, ba)
  • f(a, b) ⩴ (a[\d]+, b[0-9]+)

etc. Then I randomly & iteratively applied these transformations to a known-equal pair of starting regexes, for example (x, x). The end result is a pair of complicated but equivalent regexes. This generation algorithm is suitable for use in property-based testing.

ahelwer
  • 1,441
  • 13
  • 29
  • I find testing regex engines a very interesting topic, because it tackles some of the fundamental questions in property-based testing, like effective test-case generation, oracle problem etc. I'd like to see (and discuss) the other property tests you have; is the code open-source? – johanneslink Nov 18 '21 at 07:09
  • Another comment: Does the implementation of the analysis engine also use the same set of transformations to decide on equivalence? If so, you might have fallen prey to the "tautology trap" of automated testing. – johanneslink Nov 18 '21 at 07:14