0

If I have a bag like so:

 ({(11983070,39010451,1139539437),(11983070,53425518,11000)})

I want to select the whole bag which has the MAX last value ($2) but can only get the MAX value on its own with each of the bags.

I would like the output to be

{(11983070,39010451,1139539437)}

But cannot get it to work. Any idea?

Allan Macmillan
  • 1,481
  • 3
  • 18
  • 30

2 Answers2

1

The idea would be to first find the MAX, then add the MAX value as an extra column and then filter out all rows which do not satisfy $2==$maxValue.

Following rough code - adapted from this solution

records = LOAD 'input.txt'  AS (first:int, second:int, third:int);
records_group = GROUP records ALL;
with_max = FOREACH records_group 
       GENERATE
           FLATTEN(records.(first, second, third)), MAX(records.third) as max_third;
max_row = FILTER with_max BY records.third == max_third
Community
  • 1
  • 1
Bharat Jain
  • 654
  • 4
  • 6
1

Though you can do this in pure pig, using a UDF should be more efficient. It is also pretty straightforward:

myudfs.py

#!/usr/bin/python

@outputschema('Values:{(first:int, second:int, third:int)}')
def get_max(BAG)
    v = max(BAG, key=lambda x: x[2])

    # Since you want it to return in a bag, v needs to be in a list
    return [v]

Pig Script

REGISTER 'myudfs.py' USING jython AS myudfs ;

-- A is your input
B = FOREACH A GENERATE myudfs.get_max(my_input_bag) ;
mr2ert
  • 5,146
  • 1
  • 21
  • 32