3

I am currently attempting to implement a regular expression engine. (Yes, for fun. Go fig.)

I am working from this site for general algorithmic approach: http://swtch.com/~rsc/regexp/regexp1.html

My question for you all is: do you know of a collection of regular expressions and text strings that I can use as a comprehensive testbed for my engine? I've been searching and asking around for a couple days now, and can't find anything specific; maybe my google keyword-fu is lacking.

Thanks!

p.s. By way of example:

regexs:

  • "a"
  • "abc"
  • "^a$"
  • "[a-c]"
  • "^[^a]$"
  • "^[^a]?$"
  • "a+"
  • "."
  • ".*"
  • ".+"
  • "da?[bd]"

strings:

  • ""
  • "a"
  • "h"
  • "dd"
  • "abc"
  • "dad"
  • "dabcd"
  • "aaaaab"
ibiwan
  • 88
  • 5
  • +1 for the linked article, it's really interesting. Sorry that I can't help on the actual question, but may I ask if you know whether the performance problems in Perl/Python/Ruby etc. are still present? – Niklas B. Feb 08 '12 at 22:48
  • Yes, they are still present -- have to be, as long as they support back-references. Luckily the problem is just in pathological cases, but then it's horrible! – ibiwan Feb 08 '12 at 22:52
  • I've also seen very simple regexes like `^[asd3kgnvo]*$` perform very poorly compared to other approaches (especially in Java and Python). This is astonishing, as they have a lot of potential for optimization, I imagine. **EDIT:** Just found the performance test from a recent question: http://ideone.com/oPKYq Made me really sad :( – Niklas B. Feb 08 '12 at 22:56

1 Answers1

2

Long ago I wrote a simple filename pattern matching function (file patterns are a special subset of regyular expressions). In the code (in C) I provided a few dozen test cases. You could probably adapt them for use with a regular expression matcher.

Source is at:
http://david.tribble.com/src/fpattern.c
http://david.tribble.com/src/fpattern.h

David R Tribble
  • 11,918
  • 5
  • 42
  • 52