2

I have a set of n tokens (e.g., a, b, c) distributed among a bunch of other tokens. I would like to know if all members of my set occur within a given number of positions (window size). It occurred to me that it may be possible to write a RegEx to capture this state, but the exact syntax eludes me.

          11111
012345678901234
ab ab bc  a cba

In this example, given window size=5, I would like to match cba at positions 12-14, and abc in positions 3-7.

Is there a way to do this with RegEx, or is there some other kind of grammar that I can use to capture this logic?

I am hoping to implement this in Java.

Gene Golovchinsky
  • 6,101
  • 7
  • 53
  • 81

4 Answers4

2

Here's a regex that matches 5-letter sequences that include all of 'a', 'b' and 'c':

(?=.{0,4}a)(?=.{0,4}b)(?=.{0,4}c).{5}

So, while basically matching any 5 characters (with .{5}), there are three preconditions the matches have to observe. Each of them requires one of the tokens/letters to be present (up to 4 characters followed by 'a', etc.). (?=X) matches "X, with a zero-width positive look-ahead", where zero-width means that the character position is not moved while matching.

Doing this with regexes is slow, though.. Here's a more direct version (seems about 15x faster than using regular expressions):

public static void find(String haystack, String tokens, int windowLen) {
    char[] tokenChars = tokens.toCharArray();
    int hayLen = haystack.length();

    int pos = 0;
    nextPos:
    while (pos + windowLen <= hayLen) {
        for (char c : tokenChars) {
            int i = haystack.indexOf(c, pos);
            if (i < 0) return;

            if (i - pos >= windowLen) {
                pos = i - windowLen + 1;
                continue nextPos;
            }
        }

        // match found at pos
        System.out.println(pos + ".." + (pos + windowLen - 1) + ": " + haystack.substring(pos, pos + windowLen));
        pos++;
    }
}
xs0
  • 2,990
  • 17
  • 25
  • Well, the question said "other tokens", so I think you're just jumping to conclusions when requiring other characters to be whitespace. – xs0 Apr 30 '11 at 01:46
  • Gosh, yeah, +1 :) I did jump to conclusions. – manojlds Apr 30 '11 at 01:49
  • thanks for the answer! I wound up implementing something along these lines, but my solution is greedy. In principle, a solution that prefers tighter groupings of tokens within a window is preferable to one that takes the first match. – Gene Golovchinsky May 02 '11 at 00:10
2

This tested Java program has a commented regex which does the trick:

import java.util.regex.*;
public class TEST {
    public static void main(String[] args) {
        String s = "ab ab bc  a cba";
        Pattern p = Pattern.compile(
            "# Match 5 char sequences containing: a and b and c\n" +
            "(?=[abc])     # Assert first char is a, b or c.\n" +
            "(?=.{0,4}a)   # Assert an 'a' within 5 chars.\n" +
            "(?=.{0,4}b)   # Assert an 'b' within 5 chars.\n" +
            "(?=.{0,4}c)   # Assert an 'c' within 5 chars.\n" +
            ".{5}          # If so, match the 5 chers.", 
            Pattern.COMMENTS);
        Matcher m = p.matcher(s);
        while (m.find()) {
            System.out.print("Match = \""+ m.group() +"\"\n");
        } 
   }
}

Note that there is another valid sequence S9:13" a cb" in your test data (before the S12:14"cba". Assuming you did not want to match this one, I added an additional constraint to filter it out, which requires that the 5 char window must begin with an a, b or c.

Here is the output from the script:

Match = "ab bc"
Match = "a cba"

ridgerunner
  • 33,777
  • 5
  • 57
  • 69
1

Well, one possibility (albeit a completely impractical one) is simply to match against all permutations:

abc..|ab.c.|ab..c| .... etc.

This can be factorised somewhat:

ab(c..|.c.|..c)|a.(bc.|b.c .... etc.

I'm not sure if you can do better with regex.

Oliver Charlesworth
  • 267,707
  • 33
  • 569
  • 680
  • You can do a little better (but not much) with ranges. For example a.{0,2}b.{0,2}c matches multiple valid options at once within window 7, so you end up with less expressions - but you still have to consider permutations, of course. – Eduardo Ivanec Apr 30 '11 at 00:18
  • @Eduardo: Wouldn't that also match `aXXbXXc`? – Oliver Charlesworth Apr 30 '11 at 00:22
  • Mmm sure... shouldn't it? Note I used window 7, I thought the gain would be more obvious with a larger window. – Eduardo Ivanec Apr 30 '11 at 00:28
  • Yeah, it's the permutations that are making this impractical. The other issue is given a window `w`, how to specify `a.{0,y}b.{0,w-y}c` such that the two sub-ranges add up to `w`. – Gene Golovchinsky Apr 30 '11 at 00:29
  • @Eduardo: I don't think so. The OP wanted (in this particular example), the `a`,`b`,`c` to be contained within a 5 character window. `aXXbXXc` is 7. – Oliver Charlesworth Apr 30 '11 at 00:29
0
Pattern p = Pattern.compile("(?:a()|b()|c()|.){5}\\1\\2\\3");
String s = "ab ab bc  a cba";
Matcher m = p.matcher(s);
while (m.find())
{
  System.out.println(m.group());
}

output:

ab bc
 a cb

This is inspired by Recipe #5.7 in Regular Expressions Cookbook. Each back-reference (\1, \2, \3) acts like a zero-width assertion, indicating that the corresponding capturing group participated in the match, even though the group itself didn't consume any characters.

The authors warn that this trick relies on behavior that's undocumented in most flavors. It works in Java, .NET, Perl, PHP, Python and Ruby (original and Oniguruma), but not in JavaScript or ActionScript.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156