5

I am trying to take a logical match criteria like:

(("Foo" OR "Foo Bar" OR FooBar) AND ("test" OR "testA" OR "TestB")) OR TestZ

and apply this as a match against a file in pig using

result = filter inputfields by text matches (some regex expression here));

The problem is I have no idea how to trun the logical expression above into a regex expression for the matches method.

I have fiddled around with various things and the closest I have come to is something like this:

((?=.*?\bFoo\b | \bFoo Bar\b))(?=.*?\bTestZ\b)

Any ideas? I also need to try to do this conversion programatically if possible.

Some examples:

a - The quick brown Foo jumped over the lazy test (This should pass as it contains foo and test)

b - the was something going on in TestZ (This passes also as it contains testZ)

c - the quick brown Foo jumped over the lazy dog (This should fail as it contains Foo but not test,testA or TestB)

Thanks

user7337271
  • 1,662
  • 1
  • 14
  • 23
user2495234
  • 75
  • 1
  • 2
  • 5
  • for the eagle eyed, theres a missing ")" before "OR TestZ". Please ignore this typo. Thanks – user2495234 Sep 01 '13 at 11:24
  • If this typo is not intentional you can correct it using [[edit]] option below question instead informing others about it :) – Pshemo Sep 01 '13 at 11:26
  • I have few ideas how to write your regex but it form would depend on what input you have and what result you expect. For now I am not sure if `test` in mandatory after `foo bar` part. If so should it be also included in match (you are using look-ahead (?=...) so probably not). Also you are saying that there should be `)` before `OR TestZ` so is it right that `TestZ` is enough for single match? – Pshemo Sep 01 '13 at 11:33
  • Hi, as you rightly pointed out I can edit it...so I've added the bracket now. we effectively have a list of sentences in the inputfields file (in field text). I'm looking for just the text that matches the criteria – user2495234 Sep 01 '13 at 11:37
  • What does the AND operator in your example mean? A text can not match "Foo" and "test" both at the same time - or is it supposed to match "Foo test"? Can you post a couple of examples of your input data and which ones you want to match? – jkovacs Sep 01 '13 at 11:43
  • It needs to contain the words. I'll put an example in – user2495234 Sep 01 '13 at 11:44
  • this is a related post but doesn't address the logic element of what I need to do: http://stackoverflow.com/questions/7445832/pig-filter-a-string-on-the-basis-of-a-word – user2495234 Sep 01 '13 at 11:49

2 Answers2

13

Since you're using Pig you don't actually need an involved regular expression, you can just use the boolean operators supplied by pig combined with a couple of easy regular expressions, example:

T = load 'matches.txt' as (str:chararray);
F = filter T by ((str matches '.*(Foo|Foo Bar|FooBar).*' and str matches '.*(test|testA|TestB).*') or str matches '.*TestZ.*');
dump F;
jkovacs
  • 3,470
  • 1
  • 23
  • 24
1

You can use this regex for matches method

^((?=.*\\bTestZ\\b)|(?=.*\\b(FooBar|Foo Bar|Foo)\\b)(?=.*\\b(testA|testB|test)\\b)).*
  • note that "Foo" OR "Foo Bar" OR "FooBar" should be written as FooBar|Foo Bar|Foo not Foo|Foo Bar|FooBar to prevent matching only Foo in string containing FooBar or Foo Bar
  • also since look-ahead is zero-width you need to pass .* at the end of regex to let matches match entire string.

Demo

String[] data = { "The quick brown Foo jumped over the lazy test",
        "the was something going on in TestZ",
        "the quick brown Foo jumped over the lazy dog" };
String regex = "^((?=.*\\bTestZ\\b)|(?=.*\\b(FooBar|Foo Bar|Foo)\\b)(?=.*\\b(testA|testB|test)\\b)).*";
for (String s : data) {
    System.out.println(s.matches(regex) + " : " + s);
}

output:

true : The quick brown Foo jumped over the lazy test
true : the was something going on in TestZ
false : the quick brown Foo jumped over the lazy dog
Pshemo
  • 122,468
  • 25
  • 185
  • 269