Working with Regular Expressions - Repeating Patterns

Question

I am trying to use regular expressions to match some text.

The following pattern is what I am trying to gather.

@Identifier('VariableA', 'VariableB', 'VariableX', ..., 'VariableZ')

I would like to grab a dynamic number of variables rather than a fixed set of two or three. Is there any way to do this? I have an existing Regular Expression:

\@(\w+)\W+(\w+)\W+(\w+)\W+(\w+)

This captures the Identifier and up to three variables.

Edit: Is it just me, or are regular expressions not as powerful as I'm making them out to be?

mu is too short · Accepted Answer · 2011-10-28T03:09:22.593

You want to use scan for this sort of thing. The basic pattern would be this:

s.scan(/\w+/)

That would give you an array of all the contiguous sequences for word characters:

>> "@Identifier('VariableA', 'VariableB', 'VariableX', 'VariableZ')".scan(/\w+/)
=> ["Identifier", "VariableA", "VariableB", "VariableX", "VariableZ"]

You say you might have multiple instances of your pattern with arbitrary stuff surrounding them. You can deal with that with nested scans:

s.scan(/@(\w+)\(([^)]+?)\)/).map { |m| [ m.first, m.last.scan(/\w+/) ] }

That will give you an array of arrays, each inner array will have the "Identifier" part as the first element and that "Variable" parts as an array in the second element. For example:

>> s = "pancakes @Identifier('VariableA', 'VariableB', 'VariableX', 'VariableZ') pancakes @Pancakes('one','two','three') eggs"
>> s.scan(/@(\w+)\(([^)]+?)\)/).map { |m| [ m.first, m.last.scan(/\w+/) ] }
=> [["Identifier", ["VariableA", "VariableB", "VariableX", "VariableZ"]], ["Pancakes", ["one", "two", "three"]]]

If you might be facing escaped quotes inside your "Variable" bits then you'll need something more complex.

Some notes on the expression:

@            # A literal "@".
(            # Open a group
  \w+        # One more more ("+") word characters ("\w").
)            # Close the group.
\(           # A literal "(", parentheses are used for group so we escape it.
(            # Open a group.
  [          # Open a character class.
    ^)       # The "^" at the beginning of a [] means "not", the ")" isn't escaped because it doesn't have any special meaning inside a character class.
  ]          # Close a character class.
  +?         # One more of the preceding pattern but don't be greedy.
)            # Close the group.
\)           # A literal ")".

You don't really need [^)]+? here, just [^)]+ would do but I use the non-greedy forms by habit because that's usually what I mean. The grouping is used to separate the @Identifier and Variable parts so that we can easily get the desired nested array output.

This is perfect! Exactly the solution I was looking for. Now to learn how you created that Regular Expression! Thank you so much! — Michael, Oct 28 '11 at 02:32
You're the best! I really appreciate the help on this. I'm really trying to become more fluent with Ruby as well as Regular Expressions. I didn't even know you could do a logical NOT inside an expression like that. Seriously, thanks again! — Michael, Oct 28 '11 at 06:10
@Michael: Cheers. You can negate within a character class but not outside of one (in general). These might be of use for regexes: [Onigurma (1.9 regex engine)](http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt), [Perl's regex docs](http://perldoc.perl.org/perlre.html). The Ruby and Perl regex syntax isn't exactly the same but they're pretty close and the Perl docs are quite good. — mu is too short, Oct 28 '11 at 06:17

score 0 · Answer 2 · answered Oct 28 '11 at 01:41

0

But alex thinks that you meant you wanted to capture the same thing four times. If you want to capture the same pattern, but different things, then you may want to consider two things:

Iteration. In perl, you can say

while ($variable =~ /regex/g) {

the 'g' stands for 'global', and means that each time the regex is called, it matches the /next/ instance.

The other option is recursion. Write your regex like this:

/(what you want)(.*)/

Then, you have backreference 1 containing the first thing, which you can push to an array, and backreference 2 which you'll then recurse over until it no longer matches.

answered Oct 28 '11 at 01:41

Dan

10,531
2
36
55

Does the global identifier work in Ruby? I don't think it does. – Michael Oct 28 '11 at 01:49
http://stackoverflow.com/questions/2293032/ruby-doesnt-recognize-the-g-flag-for-regex – Dan Oct 28 '11 at 02:02

score 0 · Answer 3 · answered Oct 28 '11 at 01:43

0

You may use simply (\w+).

Given the input string @Identifier('VariableA', 'VariableB', 'VariableX', 'VariableZ')

The results would be:

Identifier
VariableA
VariableB
VariableX
VariableZ

This would work for an arbitrary number of variables.

For future reference, it's easy and fun to play around with regexp ideas on Rubular.

answered Oct 28 '11 at 01:43

zealoushacker

6,766
5
35
44

Been playing with Rubular and still can't seem to get this right. I only need the Identifier and the variables. I could extract this data manually by picking through the string one by one, but then what's the point of Regular Expressions. Also your solution is too broad, if I add any other words or word-like data after or before what I need to capture, then I am capturing unnecessary data. – Michael Oct 28 '11 at 01:48

score 0 · Answer 4 · answered Oct 28 '11 at 02:21

So you are asking if there is a way to capture both the identifier and an arbitrary number of variables. I am afraid that you can only do this with regex engines that support captures. Note here that captures and capturing groups are not the one and the same thing. You want to remember all the "variables". This can't be done with simple capturing groups.

I am unaware whether Ruby supports this or not, but I am sure that .NET and the new PERL 6 support it.

In your case you could use two regexes. One to capture the identifier e.g. ^\s*@(\w+)

and another one to capture all variables e.g. result = subject.scan(/'[^']+'/)

Working with Regular Expressions - Repeating Patterns

4 Answers4