Regex to match pipes not within brackets or braces with nested blocks

Question

I am trying to parse some wiki markup. For example, the following:

{{Some infobox royalty|testing
| name = Louis
| title = Prince Napoléon 
| elevation_imperial_note= <ref name="usgs">{{cite web|url={{Gnis3|1802764}}|title=USGS}}</ref>
| a = [[AA|aa]] | b =  {{cite
|title=TITLE
|author=AUTHOR}}
}}

can be the text to start with. I first remove the starting {{ and ending }}, so I can assume those are gone.

I want to do .split(<regex>) on the string to split the string by all | characters that are not within braces or brackets. The regex needs to ignore the | characters in [[AA|aa]], <ref name="usgs">{{cite web|url={{Gnis3|1802764}}|title=USGS}}</ref>, and {{cite|title=TITLE|author=AUTHOR}}. The expected result is:

[
 'testing'
 'name = Louis', 
 'title = Prince Napoléon', 
 'elevation_imperial_note= <ref name="usgs">{{cite web|url={{Gnis3|1802764}}|title=USGS}}</ref>',
 'a = [[AA|aa]]',
 'b =  {{cite\n|title=TITLE\n|author=AUTHOR}}'
]

There can be line breaks at any point, so I can't just look for \n|. If there is extra white space in it, that is fine. I can easily strip out extra \s* or \n*.

https://regex101.com/r/dEDcAS/2

Please [check this](https://regex101.com/r/BbTlXY/1) and take a look at *Match Information* block at right. — revo, Oct 31 '18 at 18:19
Parsing wiki markup is a problem that's already been well-solved. I'd suggest using existing code that has already been written, tested and debugged before reinventing the wheel. Googling for "ruby wiki markup parse" turned this up: https://github.com/marnen/rookie — Andy Lester, Oct 31 '18 at 18:27
@AndyLester that parser is from 10 years ago, is not maintained and doesn't work. Appreciate the tip, but not reinventing the wheel when the solution you propose doesn't work. — Zack, Oct 31 '18 at 20:23
@AndyLester I never got very far with implementing the Rookie project, but doesn’t the MediaCloth gem work for interpreting MediaWiki markup? — Marnen Laibow-Koser, Jul 22 '19 at 22:34

Cary Swoveland · Accepted Answer · 2018-11-10T19:28:07.160

The following is a pure-Ruby solution. I assume the braces and brackets in the string are balanced.

str =<<BITTER_END
Some infobox royalty|testing
| name = Louis
| title = Prince Napoléon 
| elevation_imperial_note= <ref name="usgs">{{cite web|url={{Gnis3|1802764}}|title=USGS}}</ref>
| a = [[AA|aa]] | b =  {{cite
|title=TITLE
|author=AUTHOR}}
BITTER_END

stack = []
last = 0
str.each_char.with_index.with_object([]) do |(c,i),locs|
  puts "c=#{c}, i=#{i}, locs=#{locs}, stack=#{stack}" 
  case c
  when ']', '}'
    puts "  pop #{c} from stack"
    stack.pop
  when '[', '{'
    puts "  push #{c} onto stack"
    stack << c
  when '|'
    puts stack.empty? ? "  record location of #{c}" : "  skip | as stack is non-empty" 
    locs << i if stack.empty?
  end
    puts "  after: locs=#{locs}, stack=#{stack}" 
end.map do |i|
  old_last = last
  last = i+1
  str[old_last..i-1].strip if i > 0
end.tap { |a| a << str[last..-1].strip if last < str.size }
  #=> ["Some infobox royalty",
  #    "testing",
  #    "name = Louis", 
  #    "title = Prince Napoléon",
  #    "elevation_imperial_note= <ref name=\"usgs\">
  #      {{cite web|url={{Gnis3|1802764}}|title=USGS}}</ref>",
  #    "a = [[AA|aa]]",
  #    "b =  {{cite\n|title=TITLE\n|author=AUTHOR}}"]

Note that, to improve readability, I've broken the string that is the antepenultimate element of the returned array¹.

Explanation

For an explanation of how the locations of the pipe symbols on which to split are determined, run the Heredoc above to determine str (the Heredoc needs to be un-indented first), then run the following code. All will be revealed. (The output is long, so focus on changes to the arrays locs and stack.)

stack = []
str.each_char.with_index.with_object([]) do |(c,i),locs|
  puts "c=#{c}, i=#{i}, locs=#{locs}, stack=#{stack}" 
  case c
  when ']', '}'
    puts "  pop #{c} from stack"
    stack.pop
  when '[', '{'
    puts "  push #{c} onto stack"
    stack << c
  when '|'
    puts stack.empty? ? "  record location of #{c}" : "  skip | as stack is non-empty" 
    locs << i if stack.empty?
  end
    puts "  after: locs=#{locs}, stack=#{stack}" 
end
  #=> [20, 29, 44, 71, 167, 183]

If desired, one can confirm the braces and brackets are balanced as follows.

def balanced?(str)
  h = { '}'=>'{', ']'=>'[' }
  stack = []
  str.each_char do |c|
    case c
    when '[', '{'
      stack << c
    when ']', '}'
      stack.last == h[c] ? (stack.pop) : (return false)
    end
  end   
  stack.empty?
end

balanced?(str)
  #=> true

balanced?("[[{]}]")
  #=> false

^{1 ...and, in the interest of transparency, to have the opportunity to use a certain word}.

score 0 · Answer 2 · answered Oct 09 '19 at 17:26

0

Regular expressions can’t handle arbitrary nesting (such as the brackets here), and therefore are the wrong tool for this parsing problem. If you can’t find a ready-made MediaWiki markup parser, you’ll want to use an actual parser library (such as Treetop), not regexes.

answered Oct 09 '19 at 17:26

Marnen Laibow-Koser

5,959
1
28
33

Be careful, regular expressions as defined in computer science are indeed unable to handle arbitrary nesting, but this isn't the case for what is called regular expressions *in real life* (at least in Ruby, PHP, R, Perl, .net and all that uses similar regex engines). These regex engines have features that perfectly handle nested structures (balancing groups for .net languages, recursion for the others). – Casimir et Hippolyte Oct 14 '19 at 11:01
@Casimir And how do you recurse nesting in a Ruby regex? I know Ruby regexes well, and I can’t think of how to do this *practically*. Even if it’s technically possible (which I’m not sure it is), it would be awkward, and the time would be better spent using a better tool for the job. – Marnen Laibow-Koser Oct 15 '19 at 12:05
1

A basic example: https://rubular.com/r/UizxlawOWArYrf or the answer to the question: https://rubular.com/r/Z6EqGuRrtafSFF (or more or less the same written in a more readable way: https://regex101.com/r/9fIYYX/1 ) – Casimir et Hippolyte Oct 15 '19 at 12:21
@Casimir Interesting; I think `\g` is one of the few features of Ruby regexes that I’ve never used. I still claim that these regexes are not especially clear or maintainable compared to a parser-based solution, but it’s good to know that these solutions exist. – Marnen Laibow-Koser Oct 15 '19 at 16:40

Casimir et Hippolyte · Answer 3 · 2019-10-15T18:54:01.687

It's often more complicated to split a string using a split method than scanning for the substrings you need.

Skipping pipes enclosed between brackets is relatively easy, all you have to do is to define subpatterns able to match eventually nested brackets and to consume them in the main pattern. This way, pipes enclosed between them are simply ignored.

To be sure to not match pipes outside of the main {{...}} block, if any, you have to use a \G based pattern. \G is an anchor for the position after the last successful match. It ensures each match to be contigous with the previous match. Since the closing }} is never consumed in the main pattern, you can be sure that the pattern will fail when this one is reached and that no further matches are possible.

pat = /
  # subpatterns
  (?<cb>  { [^{}]*+   (?: \g<cb> [^{}]*   )*+  } ){0} # curly brackets
  (?<sb> \[ [^\]\[]*+ (?: \g<sb> [^\]\[]* )*+ \] ){0} # square brackets

  (?<nbpw> [^|{}\]\[\s]+ ){0} # no brackets, pipes nor white-spaces

  # main pattern
  (?:
      \G (?!\A) \s* # other contigous matches branch
    |
      {{ [^|{}]*+ # first match branch
      # check if curly brackets are balanced until }} (optional but recommended)
      (?= [^{}]*+ (?: \g<cb> [^{}]* )*+ }} )
  )
  \| \s* 

  (?<result>
      \g<nbpw>?
      (?: \s* (?: \g<cb> | \g<sb> | \s \g<nbpw> ) \g<nbpw>? )*
  )
/x

str.scan(pat).map{|item| item[3]}

Note that results are already trimmed for white-spaces.

If you want to use it to process several {{...}} blocks at a time, add a capture group around the second branch of the pattern to know when the next block begins.

Regex to match pipes not within brackets or braces with nested blocks

3 Answers3