0

I'm currently working with a Markov chain text generator application in Ruby that takes in a body ("corpus") of text and then generates new text based off of that. The problem I need to solve currently is writing a Regexp that will return arrays containing the number of words that I specify. All I want to do here is grab a certain number of words (specified by the user), but multiple times throughout the whole string.

Going off another application I've seen, I'm using something like /(([.,?"();\-!':—^\w]+ ){#{depth}})/ where #{depth} interpolates how many words I want at a time. This is supposed to grab two words at a time while allowing a subset of special characters, and that's the piece that's getting me. So the total question is this: How can I specify, dynamically, the number of words (separated by whitespace) I want while also allowing a range of special characters within those words?

Here's what I have currently:

# Regex
@match_regex = /(([.,?"();\-!':—^\w]+ ){2})/
s = input.scan(@match_regex).to_a
puts s.inspect

# Input
Within weeks they planned a meeting. She sent him poetry along with her itinerary,
having worked in a business meeting to excuse the opportunity. He prepared flowers
and a banner of welcome on his hearth. 

# Output - seems to be grabbing last word again for some reason
[["Within weeks ", "weeks "], ["they planned ", "planned "], ["a meeting. ", "meeting. "],
["She sent ", "sent "], ["him poetry ", "poetry "], ["along with ", "with "],
["her itinerary, ", "itinerary, "], ["having worked ", "worked "], ["in a ", "a "],
["business meeting ", "meeting "], ["to excuse ", "excuse "],
["the opportunity. ", "opportunity. "], ["He prepared ", "prepared "], ["flowers and ", "and "],
["a banner ", "banner "], ["of welcome ", "welcome "], ["on his ", "his "]]

# Desired output. I'm not picky if it has trailing spaces or not as I can always trim that
["Within weeks", "they planned", "a meeting.", "She sent", "him poetry", "along with",
"her itinerary," "having worked", "in a", "business meeting", "to excuse", "the opportunity.",
"He prepared", "flowers and", "a banner", "of welcome", "on his"]

Any help would be greatly appreciated. Thanks!

Way Spurr-Chen
  • 405
  • 2
  • 9

2 Answers2

0

In regex every set of brackets creates a capture group, and for each match found in your input Ruby returns a list of these groups.

You have two sets of brackets: the first around the whole expression and a second around each word (note that for repeating capture groups (e.g. (foo){x} ) only the last instance is returned). Hence a two item list for each match.

To get what you want you need to remove these capturing groups. The first set can simply be removed, for the second you want to make it non-capturing group, to do this you start your brackets with ?:. The expression you want is therefore:

@match_regex = /(?:[.,?"();\-!':—^\w]+ ){2}/

hbwales
  • 158
  • 5
0

If I understand your question correctly, I think this should work for you:

def split_it(text, num_words, special_chars)
  text.scan(/(?:[\w#{special_chars}]+(?:\s+|$)){#{num_words}}/)
end

text =<<_
Within weeks they planned a meeting. She sent him poetry along with her itinerary,
having worked in a business meeting to excuse the opportunity. He prepared flowers
and a banner of welcome on his hearth.
_

special_chars = ".,?\"();\\-!':"

split_it(text, 2, special_chars)
  #=> ["Within weeks ", "they planned ", "a meeting. ", "She sent ", "him poetry ",
  #    "along with ", "her itinerary,\n", "having worked ", "in a ",
  #    "business meeting ", "to excuse ", "the opportunity. ", "He prepared ",
  #    "flowers\nand ", "a banner ", "of welcome ", "on his "]
split_it(text, 3, special_chars)
  #=> ["Within weeks they ", "planned a meeting. ", "She sent him ",
  #    "poetry along with ", "her itinerary,\nhaving ", "worked in a ",
  #    "business meeting to ", "excuse the opportunity. ", "He prepared flowers\n",
  #    "and a banner ", "of welcome on "]

Note \\- in special_chars. If you have - or \- it will appear between the brackets in the regex as - and Ruby will expect that you are defining a range, and will raise an exception. The extra backslash causes \- to appear between the brackets, telling Ruby it is the literal -. @Amadan pointed out that escapement is not needed if - is at the beginning or end of the string.

Markov chains? Hmmm.

Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100
  • Another tactic for handling `-` is making sure it is either the first or the last character in the brackets; this way, it will signify a literal dash and not a range, even without escaping. – Amadan Sep 22 '14 at 00:48