16

Edit 2: For a practical demonstration of why this remains important, look no further than stackoverflow's own regex-caused outage today (2016-07-20)!

Edit: This question has considerably evolved since I first asked it. See below for two fast+compatible, but not completely fully featured implementations. If you know of more or better implementations, please mention them, there still isn't an ideal implementation here yet!

Where can I find reliably fast Regex implementation?

Does anyone know of a normal non-backtracking (System.Text.RegularExpressions backtracks) linear time regex implementation either for .NET or native and reasonably usable from .NET? To be useful, it would need to:

  • have a worst case time-complexity of regex evaluation of O(m*n) where m is the length of the regex, and n the length of the input.
  • have a normal time-complexity of O(n), since almost no regular expressions actually trigger the exponential state-space, or, if they can, only do so on a minute subset of the input.
  • have a reasonable construction speed (i.e. no potentially exponential DFA's)
  • be intended for use by human beings, not mathematicians - e.g. I don't want to reimplement unicode character classes: .NET or PCRE style character classes are a plus.

Bonus Points:

  • bonus points for practicality if it implements stack-based features which let it handle nesting at the expense of consuming O(n+m) memory rather than O(m) memory.
  • bonus points for either capturing subexpressions or replacements (if there are an exponential number of possible subexpression matches, then enumerating all of them is inherently exponential - but enumerating the first few shouldn't be, and similarly for replacements). You can workaround missing either feature by using the other, so having either one is sufficient.
  • lotsa bonus points for treating regexes as first class values (so you can take the union, intersection, concatenation, negation - in particular negation and intersection as those are very hard to do by string manipulation of the regex definition)
  • lazy matching i.e. matching on unlimited streams without putting it all in memory is a plus. If the streams don't support seeking, capturing subexpressions and/or replacements aren't (in general) possible in a single pass.
  • Backreferences are out, they are fundamentally unreliable; i.e. can always exhibit exponential behavior given pathological input cases.

Such algorithms exist (This is basic automata theory...) - but are there any practically usable implementations accessible from .NET?

Background: (you can skip this)

I like using Regex's for quick and dirty text clean-ups, but I've repeatedly run into issues where the common backtracking NFA implemtation used by perl/java/python/.NET shows exponential behavior. These cases are unfortunately rather easy to trigger as soon as you start automatically generating your regular expressions. Even non-exponential performance can become exceedingly poor when you alternate between regexes that match the same prefix - for instance, in a really basic example, if you take a dictionary and turn it into a regular expression, expect terrible performance.

For a quick overview of why better implementations exist and have since the 60s, see Regular Expression Matching Can Be Simple And Fast.

Not quite practical options:

  • Almost ideal: FSA toolkit. Can compile regexes to fast C implementations of DFA's+NFA's, allows transducers(!) too, has first class regexes (encapsulation yay!) including syntax for intersection and parametrization. But it's in prolog... (why is something with this kind of practical features not available in a mainstream language???)
  • Fast but impractical: a full parser, such as the excellent ANTLR generally supports reliably fast regexes. However, antlr's syntax is far more verbose, and of course permits constructs that may not generate valid parsers, so you'd need to find some safe subset.

Good implementations:

  • RE2 - a google open source library aiming for reasonable PCRE compatibility minus backreferences. I think this is the successor to the unix port of plan9's regex lib, given the author.
  • TRE - also mostly compatible with PCRE and even does backreferences, although using those you lose speed guarantees. And it has a mega-nifty approximate matching mode!

Unfortunately both implementations are C++ and would require interop to use from .NET.

Eamon Nerbonne
  • 47,023
  • 20
  • 101
  • 166
  • That sounds like you may be writing your regexs in an inefficient manner. – Brad Gilbert Jul 27 '09 at 03:46
  • The whole point is that there exist implementations since the 60s (!) for which no regexes are inefficient in this sense. All regular expressions (without backreferences) *can* be evaluated in linear time - I'm looking for an implementation that dumps backreferences and gives me reliable performance instead. – Eamon Nerbonne Jul 27 '09 at 06:39
  • See [http://swtch.com/~rsc/regexp/regexp1.html Regular Expression Matching Can Be Simple And Fast] for an explanation. – Eamon Nerbonne Jul 27 '09 at 06:42

5 Answers5

11

First, what your suggesting is possible and you certainly know your subject. You also know that the trade-off of not using back-referencing implementations is memory. If you control your environment enough this is likely a reasonable approach.

The only thing I will comment on before continuing is that I would encourage you to question the choice of using RegEx. You are clearly more familiar with your specific problem and what your trying to solve so only you can answer the question. I don't think ANTLR would be a good alternative; however, A home-brew rules engine (if limited in scope) can be highly tuned to your specific needs. It all depends on your specific problem.

For those reading this and 'missing the point', here is some background reading:

From the same site, there are a number of implementations linked on this page.

The gist of the entire discussion of the above article is that the best answer is to use both. To that end, the only widely used implementation I'm aware of is the one used by the TCL language. As I understand it was originally written by Henry Spencer and it employs this hybrid approach. There have been a few attempts at porting it to a c library, though I'm not aware of any that are in wide use. Walter Waldo's and Thomas Lackner's are both mentioned and linked here. Also mentioned is the boost library though I'm not sure of the implementation. You can also look at the TCL code itself (linked from their site) and work from there.

In short, I'd go with TRE or Plan 9 as these are both actively supported.

Obviously none of these are C#/.Net and I'm not aware of one that is.

csharptest.net
  • 62,602
  • 11
  • 71
  • 89
3

If you can handle using unsafe code (and the licensing issue) you could take the implementation from this TRE windows port.

You might be able to use this directly with P/Invoke and explicit layout structs for the following:

typedef int regoff_t;
typedef struct {
  size_t re_nsub;  /* Number of parenthesized subexpressions. */
  void *value;     /* For internal use only. */
} regex_t;

typedef struct {
  regoff_t rm_so;
  regoff_t rm_eo;
} regmatch_t;


typedef enum {
  REG_OK = 0,       /* No error. */
  /* POSIX regcomp() return error codes.  (In the order listed in the
     standard.)  */
  REG_NOMATCH,      /* No match. */
  REG_BADPAT,       /* Invalid regexp. */
  REG_ECOLLATE,     /* Unknown collating element. */
  REG_ECTYPE,       /* Unknown character class name. */
  REG_EESCAPE,      /* Trailing backslash. */
  REG_ESUBREG,      /* Invalid back reference. */
  REG_EBRACK,       /* "[]" imbalance */
  REG_EPAREN,       /* "\(\)" or "()" imbalance */
  REG_EBRACE,       /* "\{\}" or "{}" imbalance */
  REG_BADBR,        /* Invalid content of {} */
  REG_ERANGE,       /* Invalid use of range operator */
  REG_ESPACE,       /* Out of memory.  */
  REG_BADRPT            /* Invalid use of repetition operators. */
} reg_errcode_t;

Then use the exports capable of handling strings with embedded nulls (with wide character support)

/* Versions with a maximum length argument and therefore the capability to
   handle null characters in the middle of the strings (not in POSIX.2). */
int regwncomp(regex_t *preg, const wchar_t *regex, size_t len, int cflags);

int regwnexec(const regex_t *preg, const wchar_t *string, size_t len,
      size_t nmatch, regmatch_t pmatch[], int eflags);

Alternatively wrap it via a C++/CLI solution for easier translation and more flexibility (I would certainly suggest this is sensible if you are comfortable with C++/CLI).

ShuggyCoUk
  • 36,004
  • 6
  • 77
  • 101
  • TRE looks good; I'd indeed use it via C++ via C++/CLI rather than doing tricky explicit layout stuff. That's much easier ;-). – Eamon Nerbonne Nov 25 '09 at 09:32
1

Where can I find robustly fast Regex implementation?

You can't.

Someone has to say it, the answer to this question given the restrictions is surely you can't - its unlikely you will find an implementation matching your constraints.

Btw, I am sure you have already tried so, but have you compiled the regex (with the option that outputs to an assembly) - I say because:

if you have a complex Regex and millions of short strings to test

eglasius
  • 35,831
  • 5
  • 65
  • 110
  • 1
    I'm hoping this answer is wrong. Certainly most parser-generators must contain such a device under the covers; all I'm looking for is one with a reasonably accessible API from .NET. The theory's there, the technology is; I just can't find it wrapped up in a handy way... – Eamon Nerbonne Nov 23 '09 at 16:23
  • I have tried compiling large regexes with the default implementation; it doesn't seem to terminate. Reducing dictionary size until it becomes handleable isn't handy, and the resulting regexes aren't as fast as they should be. – Eamon Nerbonne Nov 23 '09 at 16:26
  • just to clear out - I am not saying it can't be done, just that's very likely there isn't an already implemented one. Most devs use the default and that's enough for their purposes. Those who don't, not necessarily had the same set of constraints/characteristics that you need - and surely evaluated the tradeoff in their given scenarios. – eglasius Nov 23 '09 at 16:43
  • I know can't is a strong word ... only time will tell. btw - can u provide more info on 'large regexes' – eglasius Nov 23 '09 at 16:48
  • 1
    A large regex might literally be a dictionary converted to a regex. Or, it might be a tokenizer for some programming language. If you start making nested regexes - along the lines of a value is regex FOO and a delimited value expression is e.g. regex (FOO)(;(FOO))+, then regexes can also get quite large pretty quickly. -- Basically, I'd like to treat regexes as first class citizens where you can take the union, the intersection, the inverse, the concatenation and just generally do the obvious things that are possible with regular languages but for some obscure reason not possible as-is. – Eamon Nerbonne Nov 23 '09 at 18:16
  • I might be wrong, but the dictionary scenario seems like something that would get very good performance using non regex approaches (as each letter discards a huge amount of the words involved). Consider if you should really be treating everything like a regex problem. – eglasius Nov 23 '09 at 18:39
  • I know I _can_ solve the dictionary approach differently, and I've done that, but sometimes I then want a dictionary and some patterns. Or a dictionary minus some patterns. So you end up writing trie processing and graph structures and whatnot, and it's kinda reinventing the wheel: regexes already do all that (in principle), the default implementation just does it poorly (slowly) and fails to expose an API to do it programmatically. – Eamon Nerbonne Nov 24 '09 at 16:11
0

Consider how DFAs are created from regular expressions:

You start with a regular expression. Each operation (concat, union, Kleene closure) represents a transition between states in an NFA. The resulting DFA's states represent power sets of the states in the NFA. The states in the NFA are linear to the size of the regular expression, and therefore the DFA's states are exponential to the size of the regular expression.

So your first constraint,

have a worst case time-complexity of regex evaluation of O(m*n) where m is the length of the regex, and n the length of the input

Is impossible. The regex needs to be compiled to a 2^m-state DFA (worst case), which won't be done in linear time.

This is always the case with all but the simplest regular expressions. Ones that are so simple you can just write a quick .contains expression more easily.

Welbog
  • 59,154
  • 9
  • 110
  • 123
  • 1
    Read what you quoted - "...time-complexity of regex *evaluation*..." – Draemon Jul 24 '09 at 14:56
  • For a DFA it's not possible, but for an NFA it is: you simply don't compile the NFA into a DFA but perform direct simulation on the NFA. In your simulation, you current "DFA state" is represented by an arbitrary combination of NFA states. There are just O(m) such states, so each state transition may take involve all states, hence the overall running time of O(m*n) (or was is O(m^2 n)? anyhow, not exponential) - A DFA would have O(n) time, but potentially exponential number of states, as you say. – Eamon Nerbonne Jul 24 '09 at 14:56
  • How are you going to simulate the NFA? You'll need to fork a lot of processes, use distributed computing or run each nondeterministic branch in sequence. It'll take a lot more resources to do that, but if resources aren't a concern then yeah, I suppose it would be faster. Will you be able to find a library that does this? Probably not. They're usually constrained by resources. – Welbog Jul 24 '09 at 15:09
  • I added a link about NFA "simulation" - the term sounds more complex that the actual implementation, in which after each "step" you simply have all NFA states flagged with a boolean meaning "could I be here?" This is trivial to do in a small non-parallel implementation. – Eamon Nerbonne Jul 24 '09 at 15:14
  • 2
    Ah, fair enough. It's basically like the NFA-to-DFA conversion algorithm but run with actual input. I like it. – Welbog Jul 24 '09 at 15:27
  • Yeah, great eh? The first time I saw that, I thought - damn, it's so simple ;-). – Eamon Nerbonne Jul 24 '09 at 16:36
  • 1
    After a bit more reading, apparently you can even do DFA "conversion" in less than 2^m time (and 2^m is itself a worst-case that most regex's never come close to hitting). The trick is to do it lazily as a side effect of NFA simulation; basically, you cache every reached superposition of NFA states as a DFA state. In short, you can probably have the best of DFA+NFA worlds at the expense of a bit more memory than the NFA requires (but never more than O(min(n, #dfa-states)) memory). – Eamon Nerbonne Nov 27 '09 at 12:19
0

A quick comment: Just because you can simulate DFA construction by simulating with multiple states does not mean you are not doing the work of the NFA-DFA conversion. The difference is that you are distributing the effort over the search itself. I.e., worst case performance is unchanged.

Frank
  • 11
  • 1
  • I'm not sure *exactly* what you're getting at; but some NFA-to-DFA conversions are exponential. On the other hand, no NFA simulation ever is; after all, in a linear string you'll never have more than a linear number of state superpositions triggered. So, for the evaluation of a bounded number of bounded length strings, the simulation indeed has a better worst case performance by simply never visiting unnecessary states which the NFA-to-DFA converter does. – Eamon Nerbonne Jan 11 '10 at 09:05