A more elegant way to parse a string with ruby regular expression using variable grouping?

Question

At the moment I have a regular expression that looks like this:

^(cat|dog|bird){1}(cat|dog|bird)?(cat|dog|bird)?$

It matches at least 1, and at most 3 instances of a long list of words and makes the matching words for each group available via the corresponding variable.

Is there a way to revise this so that I can return the result for each word in the string without specifying the number of groups beforehand?

^(cat|dog|bird)+$

works but only returns the last match separately , because there is only one group.

Yes. False positives. The match can only be true if the entire string is swallowed up by the regexp. The actual list of words being used is very large, I have cut it down to make my question easier to understand. — i0n, Dec 01 '11 at 17:09

i0n · Accepted Answer · 2011-12-02T21:13:19.150

OK, so I found a solution to this.

It doesn't look like it is possible to create an unknown number of groups, so I went digging for another way of achieving the desired outcome: To be able to tell if a string was made up of words in a given list; and to match the longest words possible in each position.

I have been reading Mastering Regular Expressions by Jeffrey E. F. Friedl and it shed some light on things for me. It turns out that NFA based Regexp engines (like the one used in Ruby) are sequential as well as lazy/greedy. This means that you can dictate how a pattern is matched using the order in which you give it choices. This explains why scan was returning variable results, it was looking for the first word in the list that matched the criteria and then moved on to the next match. By design it was not looking for the longest match, but the first one. So in order to rectify this all I needed to do was reorder the array of words used to generate the regular expression from alphabetical order, to length order (longest to shortest).

array = %w[ as ascarid car id ]
list = array.sort_by {|word| -word.length } 
regexp = Regexp.union(list)

Now the first match found by scan will be the longest word available. It is also pretty simple to tell if a string contains only words in the list using scan:

if "ascarid".scan(regexp).join.length == word.length
  return true
else
  return false
end

Thanks to everyone that posted in response to this question, I hope that this will help others in the future.

Yeah, I was looking for a guarantee that `/a|aa/` would match left to right, nice to have extra confirmation. You could `array.sort_by {|word| -word.length }` if you wanted one step too. — mu is too short, Dec 02 '11 at 20:14
BTW, this turned out to be a more interesting problem than it first appeared to be, nice one. — mu is too short, Dec 03 '11 at 18:27

mu is too short · Answer 2 · 2019-03-18T16:32:42.943

You could do it in two steps:

Use /^(cat|dog|bird)+$/ (or better /\A(cat|dog|bird)+\z/) to make sure it matches.
Then string.scan(/cat|dog|bird/) to get the pieces.

You could also use split and a Set to do both at once. Suppose you have your words in the array a and your string in s, then:

words = Set.new(a)
re    = /(#{a.map{|w| Regexp.quote(w)}.join('|')})/
parts = s.split(re).reject(&:empty?)
if(parts.any? {|w| !words.include?(w) })
  # 's' didn't match what you expected so throw a
  # hissy fit, format the hard drive, set fire to
  # the backups, or whatever is appropriate.
else
  # Everything you were looking for is in 'parts'
  # so you can check the length (if you care about
  # how many matches there were) or something useful
  # and productive.
end

When you use split with a pattern that contains groups then

the respective matches will be returned in the array as well.

In this case, the split will hand us something like ["", "cat", "", "dog"] and the empty strings will only occur between the separators that we're looking for and so we can reject them and pretend they don't exist. This may be an unexpected use of split since we're more interested in the delimiters more than what is being delimited (except to make sure that nothing is being delimited) but it gets the job done.

Based on your comments, it looks like you want an ordered alternation so that (ascarid|car|as|id) would try to match from left to right. I can't find anything in the Ruby Oniguruma (the Ruby 1.9 regex engine) docs that says that | is ordered or unordered; Perl's alternation appears to be specified (or at least strongly implied) to be ordered and Ruby's certainly behaves as though it is ordered:

>> 'pancakes' =~ /(pan|pancakes)/; puts $1
pan

So you could sort your words from longest to shortest when building your regex:

re = /(#{a.sort_by{|w| -w.length}.map{|w| Regexp.quote(w)}.join('|')})/

and hope that Oniguruma really will match alternations from left to right. AFAIK, Ruby's regexes will be eager because they support backreferences and lazy/non-greedy matching so this approach should be safe.

Or you could be properly paranoid and parse it in steps; first you'd make sure your string looks like what you want:

if(s !~ /\A(#{a.map{|w| Regexp.quote(w)}.join('|')})+\z/)
  # Bail out and complain that 's' doesn't look right
end

The group your words by length:

by_length = a.group_by(&:length)

and scan for the groups from the longest words to the shortest words:

# This loses the order of the substrings within 's'...
matches = [ ]
by_length.keys.sort_by { |k| -k }.each do |group|
  re = /(#{a.map{|w| Regexp.quote(w)}.join('|')})/
  s.gsub!(re) { |w| matches.push(w); '' }
end
# 's' should now be empty and the matched substrings will be
# in 'matches'

There is still room for possible overlaps in these approaches but at least you'd be extracting the longest matches.

Maybe I should have mentioned, but I have already tried using split. The problem is that split will return the first piece that it matches, so with a large array of words in the regexp there are many false positives and mismatches using scan. I cut the word list down for the example because it would confuse the issue. Grouping is the only way that I can find that achieves what I need without many errors, but I have to specify the number of groups up front. Is there a way to do this dynamically? That is the crux of the question, not achieving matching. — i0n, Dec 01 '11 at 17:18
@i0n: So some of the "words" overlap each other and you want the longest ones to match before looking at the shorter ones? Is this actually a biology problem by chance? — mu is too short, Dec 01 '11 at 18:50
Yes that's right. So for instance, the word "ascarid" ideally would match one word, the word "ascarid". At present it would match as 3 words: "as" "car" "id". I need the pattern to be greedy but always match the entire string if it is possible! — i0n, Dec 02 '11 at 15:53
@i0n: I've added an update with some possibilities (they were too big for a comment). — mu is too short, Dec 02 '11 at 19:27
Looks like we have arrived at the same conclusion. Thanks for your help! — i0n, Dec 02 '11 at 19:59

score 1 · Answer 3 · answered Nov 30 '11 at 17:29

1

If you need to repeat parts of a regex, one option is to store the repeated part in a variable and just reference that, for example:

r = "(cat|dog|bird)"
str.match(/#{r}#{r}?#{r}?/)

answered Nov 30 '11 at 17:29

Andrew Clark

202,379
35
273
306

I am already storing the array of words (which is actually much longer than in the example) in a variable, I just removed this from the example to avoid clouding the issue. – i0n Dec 01 '11 at 17:03

score 1 · Answer 4 · answered Dec 02 '11 at 19:55

You can do it with .Net regular expressions. If I write the following in PowerShell

$pat = [regex] "^(cat|dog|bird)+$"
$m = $pat.match('birddogcatbird')
$m.groups[1].captures | %{$_.value}

then I get

bird
dog
cat
bird

when I run it. I know even less about IronRuby than I do about PowerShell, but perhaps this means you can do it in IronRuby as well.

A more elegant way to parse a string with ruby regular expression using variable grouping?

4 Answers4