2

I'm attempting to create a Regular Expressions code in Java that will have a conditional search term.

What I mean by this is let's say I have 5 words; tree, car, dog, cat, bird. Now I would like the expression to search for these terms, however is only required to match 3 out of the five, and it could be any of the 5 it chooses to match.

I thought perhaps a using a back reference ?(3) would work but doesn't seem to do the trick.

A standard optional search (?) wouldn't work either because all terms are optional, however the number of matches required is not. Essentially is there a way to create a string that must be 50% (or any percent) correct to provide a match?

Would anyone happen to know or could point me in the right direction?

(I would hopefully like it working client side if possible)

Andrew Thompson
  • 168,117
  • 40
  • 217
  • 433
confused
  • 41
  • 2
  • It has to match 3 *unique* ones out of those 5? – aioobe Aug 21 '11 at 20:43
  • Yes if at all possible. It would search all incoming emails for the words I outlined. If a certain amount of those words are matched (say 50%) in any one email, my rules engine would delete the email. If the word tree is present three times in one email, it would only return one match. If tree is present three times, and cat twice, this would return a value of two word matches and still result in a no match. – confused Aug 21 '11 at 21:27

3 Answers3

2

Does it have to be a free-standing regular expression without any further code? A simple loop testing for each word and counting matches should do this perfectly. Pseudocode assuming you want N unique matches (you can also swap the substring test with a regex, doesn't matter how you determine matches as long as you keep the counting of unique matches out of the regex):

bool has_N_words(int n, string[] words, string text) {
    int matches = 0;
    foreach word in words {
        if (word.substringOf(text)) counter++
        if (counter >= n) return true
    }
    return false
}

It seems to me the only (save mind-blowing uses of obscure regex extensions - not that I have something in mind, I've just been surprised again and again what modern regex implementations allow) way to do this with an regular expression goes like this:

  1. Enumerate all unique (ignoring order or not depending on implementation, see below) permutations of words
  2. For each permutation, build a sub-regex that matches a string containing those words, either by
    1. joining the first three words with .*? (this requires all unique permutations)
    2. using three lookahead assertions like (?=.*word) (this allows dropping word combinations that occured before in a different order)
  3. Combine all sub-regexes in a giant or.

That's impractical to do by hand, ugly and complex (as in computational complexity, not in programming effort) to do automatically, and inefficient as well as quite hacky either way.

  • Thank you very much for the response! I believe it must be a free standing regular expression. While the engine I am using is run on Regex and is server side, it seems the code required is client side only. I've attempted to use something along the lines of [cat?|dog?|tree?]{1,3} with no success. – confused Aug 21 '11 at 21:19
  • @confused: Well, if it has to be a lone regex, I fear it will have to be as I described in the second part of the answer, i.e. enumerating all possible combinations. –  Aug 22 '11 at 12:05
0

I don't see why you would want to do this with a regext but if you really need it to be a regex:

/(tree|car|dog|cat|bird)/

Then count the matches you get from that...

sg3s
  • 9,411
  • 3
  • 36
  • 52
  • Note that this only works if multiple matches per word should count. –  Aug 21 '11 at 20:59
  • Regex is being used to filter out unwanted emails in a system. So it would automatically disallow any emails that have 3 out of 5 (for arguments sake) terms mentioned. Counting the matches would not work. /(tree|car|dog|cat|bird)/ It would search for those terms and the code I have in place would automatically delete the emails. However, only if a minimum of 3 out the 5 terms are matches. It could be three, four or all words, but not 2 or one. – confused Aug 21 '11 at 21:04
0
(?i)(?s)(.*(tree|car|dog|cat|bird)){3,}?.*

The (?i) is for case insensitive and the (?s) to match new lines with .* also, since you are looking at emails.
The ? at the end is the reluctant quantifier.

I haven't actually tried it.

toto2
  • 5,306
  • 21
  • 24