1

This is kind of an odd situation, but I'm looking for a way to filter using something like MATCHES but on a list of unknown patterns (of unknown length).

That is, if the given input is two files, one with numbers A:

xxxx

yyyy

zzzz

zzyy

...etc...

And the other with patterns B:

xx.*

yyy.*

...etc...

How can I filter the first input, by all of the patterns in the second?

If I knew all the patterns beforehand, I could A = FILTER A BY (num MATCHES 'somepattern.*' OR num MATCHES 'someotherpattern'....);

The problem is that I don't know them beforehand, and since they're patterns and not simple strings, I cannot just use joins/groups (at least as far as I can tell). Maybe a strange nested FOREACH...thing? Any ideas at all?

SubSevn
  • 1,008
  • 2
  • 10
  • 27

1 Answers1

3

If you use the | which operates as an OR you can construct a pattern out of the individual patterns.

(xx.*|yyy.*|zzzz.*)

This will do a check to see if it matches any of the patterns.

Edit: To create the combined regex pattern:
* Create a string starting with (
* Read in each line (assuming each line is a pattern) and append it to a string followed by a |
* When done reading lines, remove the last character (which will be an unneeded |)
* Append a )

This will create a regex pattern to check all the patterns in the input file. (Note: It's assumed the file contains valid patterns)

QuinnG
  • 6,346
  • 2
  • 39
  • 47
  • How can that be done programmatically? As I said, I have no knowledge of how many of these patterns there are (or their contents, obviously) beforehand. – SubSevn Apr 18 '11 at 18:10
  • @SubSevn: Updated the post with logic – QuinnG Apr 18 '11 at 18:27
  • So there's no way to do this with straight Pig? – SubSevn Apr 18 '11 at 18:28
  • @SubSevn: I'm not familiar with Pig so I can't say how to implement the solution in that specifically. You might want to check http://stackoverflow.com/questions/3285082/regexp-matching-in-pig and see if that helps point you in a useful direction. – QuinnG Apr 18 '11 at 18:33
  • Ahhhh, well that's what the issue is: Pig appears to be fairly incomplete. Unfortunately, I'm somewhat constrained to that... – SubSevn Apr 18 '11 at 18:34
  • @SubSevn: I haven't used PIG or streaming at all, but checking into streaming might allow you to utilize python or another language to do the filtering and then return to a PIG script? ... Just an idea, since I haven't used those, just have some familiarity with the concepts. – QuinnG Apr 18 '11 at 18:47
  • I plan on creating a UDF in Pig to do what you're saying, read in the inputs and then combine them into a single regex, then I can just use "blah = FILTER somestuff BY x MATCHES regex" and be done with it. Thanks! – SubSevn Apr 18 '11 at 19:14