3

I am using Ruby's StringScanner to normalize some English text.

def normalize text
  s = ''
  ss = StringScanner.new text
  while ! ss.eos? do
    s += ' ' if ss.scan(/\s+/)             # mutiple whitespace => single space
    s += 'mice' if ss.scan(/\bmouses\b/)   # mouses => mice
    s += '' if ss.scan(/\bthe\b/)          # remove 'the'
    s += "#$1 #$2" if ss.scan(/(\d)(\w+)/) # should split 3blind => 3 blind
  end
  s
end

normalize("3blind the   mouses")  #=> should return "3 blind mice"

Instead I am just getting " mice".

StringScanner#scan is not capturing the (\d) and (\w+).

Deduplicator
  • 44,692
  • 7
  • 66
  • 118
zhon
  • 1,610
  • 1
  • 22
  • 31

2 Answers2

4

To access a StringScanner captured (in Ruby 1.9 and above), you use StringScanner#[]:

  s += "#{ss[1]} #{ss[2]}" if ss.scan(/(\d)(\w+)/) # splits 3blind => 3 blind

In Ruby 2.1, you should be able to capture by name (See Peter Alfvin's link)

  s += "#{ss[:num]} #{ss[:word]}" if ss.scan(/(?<num>\d)(?<word>\w+)/)
zhon
  • 1,610
  • 1
  • 22
  • 31
2

Note: The first version of this/my answer was completely off base, per the comment thread. Apologies.

Based on experimentation and review of http://ruby-doc.org/stdlib-1.9.2/libdoc/strscan/rdoc/StringScanner.html, it appears that StringScanner does not set the match variables $1, $2, etc., so that last s += ... statement is only appending a blank to s.

Looking at strscan.c it appears that indeed there is no support for providing captured match information, but I did find https://www.ruby-forum.com/topic/4413436, which appears to be an in-progress effort of some sort to implement this

Peter Alfvin
  • 28,599
  • 8
  • 68
  • 106
  • Changing when I call ``normalize`` to avoid confusion. – zhon Nov 14 '13 at 23:23
  • Actually, the [scan pointer](http://www.ruby-doc.org/stdlib-2.0.0/libdoc/strscan/rdoc/StringScanner.html) will not move until ``scan`` returns something other than ``nil`` (I am not using any other method to advance the scan pointer). Therefore, the first thing it can ``scan`` is "3blind". I have left out ``next`` to simplify the question. – zhon Nov 14 '13 at 23:34