Is "regex" in modern programming languages really "context sensitive grammar"?

Question

Over the years, "regex" pattern matching has been getting more and more powerful to the point where I wonder: is it really just context-sensitive-grammar matching? Is it a variation/extension of context-free-grammar matching? Where is it right now and why don't we just call it that instead of the old, restrictive "regular expression"?

Fabian Steeg · Accepted Answer · 2009-03-04T22:39:56.613

12

In particular backreferences to capturing parentheses make regular expressions more complex than regular, context-free, or context-sensitive grammars. The name is simply historically grown (as many words). See also this section in Wikipedia and this explanation with an example from Perl.

edited Mar 04 '09 at 22:39

answered Mar 04 '09 at 22:15

Fabian Steeg

44,988
7
85
112

Could you please explain the difference between `regular language` and `regular expression`? – Christian Klauser Mar 04 '09 at 22:18
2

Is it really more powerful than CSG? Could you give an example? – notnot Mar 04 '09 at 22:22
A regular language can be described by a regular grammar (see http://en.wikipedia.org/wiki/Regular_grammar), while regular expressions are a pattern matching language that is less restricted and therefore more complex to process. – Fabian Steeg Mar 04 '09 at 22:25
Thanks for your comment notnot, I've added a link to a sample and some details. – Fabian Steeg Mar 04 '09 at 22:41
Hmmm... does this mean that we can match against arbitrary CSG's using today's tools. – notnot Mar 04 '09 at 22:51
Hm, not really sure what you are asking but I believe no, it simply means some regular expressions can't even be described with a CSG and so can't be parsed by a linear bounded automaton. – Fabian Steeg Mar 04 '09 at 23:02
Where are the boundary lines, then, on our regex tools? How can we write expressions that defy the limits of CSG's and yet not be able to express all CSG's? – notnot Mar 04 '09 at 23:14
I don't know, I just meant to say that the fact that some things work that cannot be expressed by a CFG does not necessarily imply that all things that can be will work too. – Fabian Steeg Mar 04 '09 at 23:50
Hey, I like the idea of defining "regex" as distinct from "regular expressions", and not just a contraction. Problem is, it still looks like a contraction. – 13ren Mar 05 '09 at 00:42
@notnot, not all sets/pairs have a strict order (which is a subset of which out of `{1,2,3}` and `{1,2,4}` - the answer is neither). Similarly it is quite easy to show that (for e.g.) Parsing Expression Grammars can match some languages that cannot be matched by CFGs (e.g. `a^nb^nc^n` is not context free but can be parsed by a PEG) and simultaneously there is not a PEG equivalent for all CFGs. – tobyodavies Feb 27 '11 at 13:51

Christian Klauser · Answer 2 · 2009-03-06T19:11:30.967

The way I see it:

Regular languages:
- Matched by state machines. Only one variable can be used to represent the current "location" in the grammar to be matched: Recursion cannot be implemented
Context-free languages:
- Matched by a stack machine. The current "location" in the grammar is represented by a stack in one or another form. Cannot "remember" anything that occurred before
Context-sensitive languages:
- Most programming languages
- ~~All~~ Most human languages

I do know of regular expression parsers that allow you to match against something the parser has already encountered, achieving something like a context-sensitive grammar.

Still, regular expression parsers, however sophisticated they may be, don't allow for recursive application of rules, which is a definite requirement for context-free grammars.

The term regex, in my opinion, mostly refers to the syntax used to express those regular grammars (the stars and question marks).

Lookahead/lookbehind and naming definitely add something that sits outside of standard regular expressions - memory. So aren't we at PDA level? — notnot, Mar 04 '09 at 22:28
It's not in general true that natural language is context-sensitive, see http://www.eecs.harvard.edu/~shieber/Biblio/Papers/shieber85.pdf — Fabian Steeg, Mar 04 '09 at 22:54

Gumbo · Answer 3 · 2009-03-05T00:16:25.667

4

There are features in modern regular expression implementations that break the rules of the classic regular expression definition.

For example Microsoft’s .NET Balancing Group (?<name1-name2> … ):

^(?:0(?<L>)|1(?<-L>))*(?(L)(?!))$

This does match the language L₀₁ = {ε, 01, 0011, 000111, … }. But this language is not regular according to the Pumping Lemma.

edited Mar 05 '09 at 00:16

answered Mar 04 '09 at 22:45

Gumbo

643,351
109
780
844

I know that it goes beyond classic regex, but I'm wondering how much further. Fabian's link above is interesting. – notnot Mar 04 '09 at 22:49

Is "regex" in modern programming languages really "context sensitive grammar"?

3 Answers3

Linked