19

I have this regex:

^(^?)*\?(.*)$

If I understand correctly, this is the breakdown of what it does:

  • ^ - start matching from the beginning of the string
  • (^?)* - I don't know know, but it stores it in $1
  • \? - matches a question mark
  • (.*)$ - matches anything until the end of the string

So what does (^?)* mean?

AD7six
  • 63,116
  • 12
  • 91
  • 123
doremi
  • 14,921
  • 30
  • 93
  • 148

4 Answers4

22

The (^?) is simply looking for the literal character ^. The ^ character in a regex pattern only has special meaning when used as the first character of the pattern or the first character in a grouping match []. When used outside those 2 positions the ^ is interpreted literally meaning in looks for the ^ character in the input string

Note: Whether or not ^ outside of the first and grouping position is interpreted literally is regex engine specific. I'm not familiar enough with LUA to state which it does

JaredPar
  • 733,204
  • 149
  • 1,241
  • 1,454
  • Hmm. I still don't get it. Can you give me an example of a string where this would match? FYI - this is being used on a url with a query string. – doremi Mar 04 '13 at 16:18
  • 3
    no comment on the pointelessness of `(^?)*`? i.e. it's a 0 or one character match, matching only the character `^`, matching 0 to many times - the same (probably) as `(^*)` unless the multiple groups are being used – AD7six Mar 04 '13 at 16:18
  • It could be a bad regex as it would provided to me by someone else. That's part of the reason why I'm trying to understand what it does. – doremi Mar 04 '13 at 16:19
  • @AD7six agreed that's most likely bogus. It *could* be valid it certain regex engines (Vim with no magic for example) but that setting would also invalidate my answer. Very likely bogus but wanted to know the regex engine before I dove outside the specifics of the question – JaredPar Mar 04 '13 at 16:23
  • @JaredPar: Your example is very confusing. .NET is a pretty bad example, since it has special meaning in every single case you have there. It may be true that Lua treat `^` as literal character, but let me double check. – nhahtdh Mar 04 '13 at 18:25
  • @nhahtdh there are 5 occurrences of `^` in my patterns, only 3 have special meaning. – JaredPar Mar 04 '13 at 19:03
  • @JaredPar: All of them has special meaning. I have checked it on regexhero (.NET tester). I think you should use `Match` instead of `IsMatch` to check in detail what is actually matched: http://ideone.com/T6Jyxu – nhahtdh Mar 04 '13 at 19:10
  • @nhahtdh don't believe that's correct. The second `^` is interpreted literally. I've tested this locally – JaredPar Mar 04 '13 at 19:16
  • @JaredPar: Maybe with Lua, but not with .NET. Have you tried my code on ideone? – nhahtdh Mar 04 '13 at 19:17
7

Lua does not have a conventional regexp language, it has Lua patterns in its place. While they look a lot like regexp, Lua patterns are a distinct language of their own that has a simpler set of rules and most importantly lacks grouping and alternation features.

Interpreted as a Lua pattern, the example will surprising a longtime regexp user since so many details are different.

Lua patterns are described in PiL, and at a first glance are similar enough to a conventional regexp to cause confusion. The biggest differences are probably the lack of an alternation operator |, parenthesis are only used to mark captures, quantifiers (?, -, +, and *) only apply to a character or character class, and % is the escape character not \. A big clue that this example was probably not written with Lua in mind is the lack of the Lua pattern quoting character % applied to any (or ideally, all) of the non-alphanumeric characters in the pattern string, and the suspicious use of \? which smells like a conventional regexp to match a single literal ?.

The simple answer to the question asked is: (^?)* is not a recommended form, and would match ^* or *, capturing the presence or absence of the caret. If that were the intended effect, then I would write it as (%^?)%* to make that clearer.

To see why this is the case, let's take the pattern given and analyze it as a Lua pattern. The entire pattern is:

^(^?)*\?(.*)$

Handed to string.match(), it would be interpreted as follows:

^ anchors the match to the beginning of the string.

( marks the beginning of the first capture.

^ is not at the beginning of the pattern or a character class, so it matches a literal ^ character. For clarity that should likely have been written as %^.

? matches exactly zero or one of the previous character.

) marks the end of the first capture.

* is not after something that can be quantified so it matches a literal * character. For clarity that should likely have been written as %*.

\ in a pattern matches itself, it is not an escape character in the pattern language. However, it is an escape character in a Lua short string literal, making the following character not special to the string literal parser which in this case is moot because the ? that follows was not special to it in any case. So if the pattern were enclosed in double or single quotes, then the \ would be absorbed by string parsing. If written in a long string (as [[^(^?)*\?(.*)$]], the backslash would survive the string parser, to appear in the pattern.

? matches exactly zero or one of the previous character.

( marks the beginning the second capture.

. matches any character at all, effectively a synonym for the class [\000-\255] (remember, in Lua numeric escapes are in decimal not octal as in C).

* matches zero or more of the previous character, greedily.

) marks the end of the second capture.

$ anchors the pattern to the end of the string.

So it matches and captures an optional ^ at the beginning of the string, followed by *, then an optional \ which is not captured, and captures the entire rest of the string. string.match would return two strings on success (either or both of which might be zero length), or nil on failure.

Edit: I've fixed some typos, and corrected an error in my answer, noticed by Egor in a comment. I forgot that in patterns, special symbols loose their specialness when in a spot where it can't apply. That makes the first asterisk match a literal asterisk rather than be an error. The cascade of that falls through most of the answer.

Note that if you really want a true regexp in Lua, there are libraries available that will provide it. That said, the built-in pattern language is quite powerful. If it is not sufficient, then you might be best off adopting a full parser, and use LPeg which can do everything a regexp can and more. It even comes with a module that provides a complete regexp syntax that is translated into an LPeg grammar for execution.

RBerteig
  • 41,948
  • 7
  • 88
  • 128
  • Actually, the first `*` is not an error, it is just magic-less. For example, `assert(string.match("^*", "^(^?)*"))` – Egor Skriptunoff Dec 13 '16 at 07:52
  • @EgorSkriptunoff I think you are right. The effect is much the same, the pattern doesn't match what a regex user thinks it would match. – RBerteig Dec 14 '16 at 02:07
2

In this case, the (^?) refers to the previous string "^" meaning the literal character ^ as Jared has said. Check out regexlib for any further deciphering.

For all your Regex needs: http://regexlib.com/CheatSheet.aspx

Tui Popenoe
  • 2,098
  • 2
  • 23
  • 44
1

It looks to me like the intent of the creator of the expression was to match any number of ^ before the question mark, but only wanted to capture the first instance of ^. However, it may not be a valid expression depending on the engine, as others have stated.

adam0101
  • 29,096
  • 21
  • 96
  • 174