2

I am trying to fix a bit of regex I have for a chatops bot for lita. I have the following regex:

/^(?:how\s+do\s+I\s+you\s+get\s+far\s+is\s+it\s+from\s+)?(.+)\s+to\s+(.+)/i

This is supposed to capture the words before and after 'to', with optional words in front that can form questions like: How do I get from x to y, how far from x to y, how far is it from x to y.

expected output:

match 1 : "x"
match 2 : "y"

For the most part my optional words work as expected. But when I pull my response matches, I get the words leading up to the first capture group included.

So, how far is it from sfo to lax should return:

sfo and lax.

But instead returns:

how far is it from sfo and lax

joelparkerhenderson
  • 34,808
  • 19
  • 98
  • 119
cashman04
  • 1,134
  • 2
  • 13
  • 27
  • The regex is doing what you said at the end of your question. What is it you are trying to accomplish? Edit your question to give a clear idea of expected output. Use some blank lines around it to make it readable. – Beartech Mar 21 '15 at 22:02
  • Sorry, the problem is with the first captured group. In my last example, I should only get `sfo` returned, but instead I get `how far is it from sfo`. I'll edit my question to clarify it better. – cashman04 Mar 21 '15 at 22:05

2 Answers2

3

Your glitch is that the first chunk of your regex doesn't make sense.

To choose from multiple options, use this syntax:

(a|b|c)

What I think you're trying to do is this:

/^(?:(?:how|do|I|you|get|far|is|it|from)\s+)*(.+)\s+to\s+(.+)/i

The regexp says to skip all the words in the multiple options, regardless of order.

If you want to preserve word order, you can use regexps such as this pseudocode:

… how (can|do|will) (I|you|we) (get|go|travel) from …
joelparkerhenderson
  • 34,808
  • 19
  • 98
  • 119
  • This works. The way I had it was needlessly trying to have it match multiple sentence possibilities. Just matching one of the words that would be in the common ways someone would ask for directions seems adequate. Thanks for the help, and simplification of what I was doing. – cashman04 Mar 21 '15 at 22:14
  • @joelparkerhenderson, does the fact that the first two sets of parens (the nested ones) have a modifier after them, `\s+` on the inner and `*` on the outer, determine why that does not get treated as a match group and just as a combination of terms? – Beartech Mar 21 '15 at 22:18
  • @Beartech The `\s` means whitespace. The `\s+` means one or more whitespaces, a.k.a. anything that we expect to be between a word. The `*` means zero or more. The inner paren group means "skip one word followed by whitespace". The outer paren group means "do this repeatedly" i.e. "skip all these words". – joelparkerhenderson Mar 21 '15 at 22:25
  • I understand the regex, but just realized I was ignoring the meaning of the non-capture group `?:` – Beartech Mar 21 '15 at 22:31
  • cashman04, this site is your best friend for regex: http://rubular.com , though it strangely does not mention the non-capture syntax. – Beartech Mar 21 '15 at 22:34
  • 1
    It's worth noting that it's often easiest to build regexes such as this one with code. If `pre_words = ["how", "do", "I", "you", "get", "far", "is", "it", "from"]`, one could obtain Joel's regex as follows: `/^(?:(?:#{pre_words.join('|')})\s+)*(.+)\s+to\s+(.+)/i` or `/^(?:(?:#{Regexp.union(pre_words)})\s+)*(.+)\s+to\s+(.+)/i`. This also has the advantage that if `pre_words` changes, the regex is automatically updated. – Cary Swoveland Mar 22 '15 at 04:24
1

When you want to match words, \w is the most natural pattern I'd use (e.g., it is used in word count tools.)

To capture any 1 word before and after a "to" can be done with (\w+\sto\s+\w*) regex.

To return them as 2 different groups, you can use (\w+)\s+to\s+(\w+).

Have a look at the demo.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • I think the OP wanted to be able to catch instance where there might be more than one word on either side of "to", but not part of the basic "question words". i.e. "how far is it from my town to your town". joelparkerhenderson answer provided for that instance. But I give you an upvote for the use of \w. – Beartech Mar 21 '15 at 22:23