Using PIG with Hadoop, how do I regex match parts of text with an unknown number of groups?

Question

I'm using Amazon's elastic map reduce.

I have log files that look something like this

   random text foo="1" more random text foo="2"
   more text notamatch="5" noise foo="1"
   blah blah blah foo="1" blah blah foo="3" blah blah foo="4" ...

How can I write a pig expression to pick out all the numbers in the 'foo' expressions?

I prefer tuples that look something like this:

(1,2)
(1)
(1,3,4)

I've tried the following:

TUPLES = foreach LINES generate FLATTEN(EXTRACT(line,'foo="([0-9]+)"'));

But this yields only the first match in each line:

(1)
(1)
(1)

score 0 · Answer 1 · answered Sep 04 '14 at 07:13

0

REGEX_EXTRACT function may help you to get your desired output

REGEX_EXTRACT(input, 'foo=(.*)',2) AS input;

answered Sep 04 '14 at 07:13

Donald Miner · Answer 2 · 2010-12-30T16:09:49.567

0

You could use STRSPLIT: http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#STRSPLIT

The regex to split on would be [^0-9]+ (i.e., not numbers) This will effectively split on large portions of non-numbers, leaving only tokens of numerical digits.

Another option would be to write a Pig UDF.

edited Dec 30 '10 at 16:09

answered Dec 30 '10 at 14:49

Donald Miner

38,889
8
95
118

Yes, you are right. Unfortunately, my example was misleading in that it didn't illustrate that I actually have other numeric expressions I don't want to match. I've updated my example to be more illustrative. – lmonson Dec 30 '10 at 16:51
Could you not use a Pig UDF to do this? – Donald Miner Dec 30 '10 at 16:53

Using PIG with Hadoop, how do I regex match parts of text with an unknown number of groups?

2 Answers2