0

I'm having trouble specifying "the next character should not be from this group of characters" in my regex. I have

TOKENS = [":", ".", "'"]
"01:39\t" =~  /\b0\d[#{Regexp.union(TOKENS)}]\d\d^#{Regexp.union(TOKENS)}/
 #=> nil

Since "\t" is not part of my TOKENS array, I would think the above should match, but it does not. How do I adjust my regex, specifically this part

^#{Regexp.union(TOKENS)}

to say that the character should not be part of this array?

Sagar Pandya
  • 9,323
  • 2
  • 24
  • 35
Dave
  • 15,639
  • 133
  • 442
  • 830

2 Answers2

0

You need brackets around the "not" portion of the regex.

>> TOKENS = [":", ".", "'"]
>> regex = /\b0\d[#{Regexp.union(TOKENS)}]\d\d^#{Regexp.union(TOKENS)}/
>> "01:39\t" =~ regex
#=> nil

However:

>> regex = /\b0\d[#{Regexp.union(TOKENS)}]\d\d[^#{Regexp.union(TOKENS)}]/
# Add brackets                            here^                and here^
>> "01:39\t" =~ regex
#=> 0
moveson
  • 5,103
  • 1
  • 15
  • 32
0

Your /\b0\d[#{Regexp.union(TOKENS)}]\d\d^#{Regexp.union(TOKENS)}/ pattern will finally look like

/(?-mix:\b0\d[(?-mix::|\.|')]\d\d^(?-mix::|\.|'))/
             ^^^^^^^^^^^^^^^^    ^^^^^^^^^^^^^^^

Here, the regex object is a modifier group with disabled multiline, case insensitive and free spacing modes. The last ^ is the start of the line anchor, and it alone ruins the whole regex turning it into a pattern that never matches any string.

It is not enough to wrap the #{Regexp.union(TOKENS)} with [...] character class brackets, you would need to use the .source property to get rid of (?-mix:...) since you do not want to negate m, i, x, etc. However, you just can't use Regexp.union since it will add | char and inside a character class, it is treated as a literal char (so, you will also negate pipes).

You should define the separator sequence with TOKENS.join().gsub(/[\]\[\^\\-]/, '\\\\\\&') to escape all chars that should be escaped inside a regex character class and then place in between character class square brackets.

Ruby demo:

TOKENS = [":", ".", "'", "]"]
sep_rx = TOKENS.join().gsub(/[\]\[\^\\-]/, '\\\\\\&')
puts sep_rx
# => :.'\]
rx = /\b0\d[#{sep_rx}]\d\d[^#{sep_rx}]/
puts rx.source
# => \b0\d[:.'\]]\d\d[^:.'\]]
puts "01:39\t" =~  rx
# => 0

See the Rubular demo

Note that .gsub(/[\]\[\^\\-]/, '\\\\\\&') matches ], [, ^, \ and - and adds a backslash in front of them. The first 4 backslashes in '\\\\\\&' define a literal backslash in the replacement pattern and \\& stands for the whole match

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563