10

Over the years, "regex" pattern matching has been getting more and more powerful to the point where I wonder: is it really just context-sensitive-grammar matching? Is it a variation/extension of context-free-grammar matching? Where is it right now and why don't we just call it that instead of the old, restrictive "regular expression"?

notnot
  • 4,472
  • 12
  • 46
  • 57

3 Answers3

12

In particular backreferences to capturing parentheses make regular expressions more complex than regular, context-free, or context-sensitive grammars. The name is simply historically grown (as many words). See also this section in Wikipedia and this explanation with an example from Perl.

Fabian Steeg
  • 44,988
  • 7
  • 85
  • 112
  • Could you please explain the difference between `regular language` and `regular expression`? – Christian Klauser Mar 04 '09 at 22:18
  • 2
    Is it really more powerful than CSG? Could you give an example? – notnot Mar 04 '09 at 22:22
  • A regular language can be described by a regular grammar (see http://en.wikipedia.org/wiki/Regular_grammar), while regular expressions are a pattern matching language that is less restricted and therefore more complex to process. – Fabian Steeg Mar 04 '09 at 22:25
  • Thanks for your comment notnot, I've added a link to a sample and some details. – Fabian Steeg Mar 04 '09 at 22:41
  • Hmmm... does this mean that we can match against arbitrary CSG's using today's tools. – notnot Mar 04 '09 at 22:51
  • Hm, not really sure what you are asking but I believe no, it simply means some regular expressions can't even be described with a CSG and so can't be parsed by a linear bounded automaton. – Fabian Steeg Mar 04 '09 at 23:02
  • Where are the boundary lines, then, on our regex tools? How can we write expressions that defy the limits of CSG's and yet not be able to express all CSG's? – notnot Mar 04 '09 at 23:14
  • I don't know, I just meant to say that the fact that some things work that cannot be expressed by a CFG does not necessarily imply that all things that can be will work too. – Fabian Steeg Mar 04 '09 at 23:50
  • Hey, I like the idea of defining "regex" as distinct from "regular expressions", and not just a contraction. Problem is, it still looks like a contraction. – 13ren Mar 05 '09 at 00:42
  • @notnot, not all sets/pairs have a strict order (which is a subset of which out of `{1,2,3}` and `{1,2,4}` - the answer is neither). Similarly it is quite easy to show that (for e.g.) Parsing Expression Grammars can match some languages that cannot be matched by CFGs (e.g. `a^nb^nc^n` is not context free but can be parsed by a PEG) and simultaneously there is not a PEG equivalent for all CFGs. – tobyodavies Feb 27 '11 at 13:51
5

The way I see it:

  • Regular languages:
    • Matched by state machines. Only one variable can be used to represent the current "location" in the grammar to be matched: Recursion cannot be implemented
  • Context-free languages:
    • Matched by a stack machine. The current "location" in the grammar is represented by a stack in one or another form. Cannot "remember" anything that occurred before
  • Context-sensitive languages:
    • Most programming languages
    • All Most human languages

I do know of regular expression parsers that allow you to match against something the parser has already encountered, achieving something like a context-sensitive grammar.

Still, regular expression parsers, however sophisticated they may be, don't allow for recursive application of rules, which is a definite requirement for context-free grammars.

The term regex, in my opinion, mostly refers to the syntax used to express those regular grammars (the stars and question marks).

Christian Klauser
  • 4,416
  • 3
  • 31
  • 42
  • Lookahead/lookbehind and naming definitely add something that sits outside of standard regular expressions - memory. So aren't we at PDA level? – notnot Mar 04 '09 at 22:28
  • 1
    It's not in general true that natural language is context-sensitive, see http://www.eecs.harvard.edu/~shieber/Biblio/Papers/shieber85.pdf – Fabian Steeg Mar 04 '09 at 22:54
4

There are features in modern regular expression implementations that break the rules of the classic regular expression definition.

For example Microsoft’s .NET Balancing Group (?<name1-name2> … ):

^(?:0(?<L>)|1(?<-L>))*(?(L)(?!))$

This does match the language L₀₁ = {ε, 01, 0011, 000111, … }. But this language is not regular according to the Pumping Lemma.

Gumbo
  • 643,351
  • 109
  • 780
  • 844
  • I know that it goes beyond classic regex, but I'm wondering how much further. Fabian's link above is interesting. – notnot Mar 04 '09 at 22:49