2

I am suffering from a regex problem in R here. I have three sentences:

s1 <- "today john jack and joe go to the beach"
s2 <- "today joe and john go to the beach"
s3 <- "today jack and joe go to the beach"

I want to know of each sentence whether john is going to the beach today, regardless of the other two guys. So the outcome for the three sentences should be (in order)

TRUE
TRUE
FALSE 

I try to do this with grepl in R. The following regex gives TRUE to all sentences:

print(grepl("today (john|jack|joe|and| )+go to the beach", s1))
print(grepl("today (john|jack|joe|and| )+go to the beach", s2))
print(grepl("today (john|jack|joe|and| )+go to the beach", s3))

It helps when I sandwich "john", the compulsory word, between two identical quantifiers for the other, optional words:

print(grepl("today (jack|joe|and| )*john(jack|joe|and| )*go to the beach", s1))
print(grepl("today (jack|joe|and| )*john(jack|joe|and| )*go to the beach", s2))
print(grepl("today (jack|joe|and| )*john(jack|joe|and| )*go to the beach", s3))

However, this is obviously bad coding (repetitions). Anyone has a more elegant solution?

2 Answers2

2

You may use .* in places where you do not know what may appear there:

s <- c("today john jack and joe go to the beach", "today joe and john go to the beach", "today jack and joe go to the beach")
grepl("today .*\\bjohn\\b.* go to the beach", s)
## => [1]  TRUE  TRUE FALSE

See online R demo

The \b word boundaries are used to match john as a whole word.

EDIT: If you have a pre-defined whitelist of words that may appear between today and go, you cannot just match anything, you need to use an alternation group with all those alternative listed, and - if you really want to shorten the pattern - use the subroutine call within a PCRE regex:

> grepl("today ((?:jack|joe|and| )*)john(?1)\\bgo to the beach", s, perl=TRUE)
[1]  TRUE  TRUE FALSE

See the regex demo.

Here, the alternatives are wrapped within a non-capturing group that is quantified, and the whole group is wrapped with a "technical" capturing group that can be recursed with the (?1) subroutine call (1 means capturing group #1).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • I thought of the .*, but I do not want that anything can come there, e.g. "today john does not go to the beach" would match your regex. – user3766450 Dec 08 '16 at 11:54
  • So, what is the actual requirement then? Please define what *can* appear between `today` and `go`. Once you come up with concrete specs, the regex will be updated in no time. – Wiktor Stribiżew Dec 08 '16 at 11:54
  • Maybe it is hard to explain. I want to edit the expression "(john|jack|joe|and| )*" such that at least john appears at least one time in this part of the regex. The words jack, joe, and and the space are optional, but there cannot be other words. As said, "(jack|joe|and| )*john(jack|joe|and| )*" does the trick but is long and has a repitition. – user3766450 Dec 08 '16 at 13:40
  • I updated the answer, please see if it fits your understanding of "elegant". – Wiktor Stribiżew Dec 08 '16 at 14:32
  • Pretty nice, Wiktor, and definitely elegant :) – user3766450 Dec 09 '16 at 09:38
0

Do you need to validate the rest of the sentence? Because otherwise I’d go for simple:

sentences = c(s1, s2, s3)
grepl('\\bjohn\\b', sentences)
# [1]  TRUE  TRUE FALSE

This performs less validation but it expresses the intent of the statement much more obviously: “does John appear in the sentence?

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • Yes, the rest of the sentence has to be matched and could contain "john" again, e.g. "today jack and joe go to the beach but not john" should give FALSE. – user3766450 Dec 08 '16 at 11:56
  • @user3766450 Frankly, at that point you should stop using regex (which are simply the wrong tool for the job) and start using [natural language processing](https://cran.r-project.org/web/views/NaturalLanguageProcessing.html). There’s no way that regex will give you an acceptable precision/recall ratio, since they do not actually parse English, they only perform character-by-character matching. – Konrad Rudolph Dec 08 '16 at 12:00