hadoop cascading how to get top N tuples

Question

New to cascading, trying to find out a way to get top N tuples based on a sort/order. for example, I'd like to know the top 100 first names people are using.

here's what I can do similar in teradata sql:

select top 100 first_name, num_records   
from
    (select first_name, count(1) as num_records   
     from table_1  
     group by first_name) a  
order by num_records DESC

Here's similar in hadoop pig

a = load 'table_1' as (first_name:chararray, last_name:chararray);
b = foreach (group a by first_name) generate group as first_name, COUNT(a) as num_records;
c = order b by num_records DESC;
d = limit c 100;

It seems very easy to do in SQL or Pig, but having a hard time try to find a way to do it in cascading. Please advise!

Engineiro · Answer 1 · 2013-05-01T13:38:55.500

Assuming you just need the Pipe set up on how to do this:

In Cascading 2.1.6,

Pipe firstNamePipe = new GroupBy("topFirstNames", InPipe,  
                                 new Fields("first_name"),
                                 );

firstNamePipe = new Every(firstNamePipe, new Fields("first_name"), 
                          new Count("num_records"), Fields.All);

firstNamePipe = new GroupBy(firstNamePipe,  
                                 new Fields("first_name"),
                                 new Fields("num_records"),
                                 true); //where true is descending order

firstNamePipe = new Every(firstNamePipe, new Fields("first_name", "num_records")
                          new First(Fields.Args, 100), Fields.All)

Where InPipe is formed with your incoming tap that holds the tuple data that you are referencing above. Namely, "first_name". "num_records" is created when new Count() is called.

If you have the "num_records" and "first_name" data in separate taps (tables or files) then you can set up two pipes that point to those two Tap sources and join them using CoGroup.

The definitions I used were are from Cascading 2.1.6:

GroupBy(String groupName, Pipe pipe, Fields groupFields, Fields sortFields, boolean reverseOrder)

Count(Fields fieldDeclaration)

First(Fields fieldDeclaration, int firstN)

Hi Engineiro, i think you are grouping by on the "first_name" field, and sorting on the num_records within the same group, i.e. sorting only within the group with the same first name. But what I want to do here is to get the top first names. sort of a group all and then get top rows. — Kartrace, Apr 30 '13 at 21:36
what i can think of so far is to add a constant field to the {first_name, num_records} scheme and group by on that constant field to get me a single group. then sort on num_records and get top N. — Kartrace, Apr 30 '13 at 21:39
you're right. I made some edits. Keep in mind though that this is all local sort. Hadoop and cascading in general are not very keen to total sorts. For total sort, you need one reducer in cascading. — Engineiro, May 01 '13 at 13:40

Nagendra kumar · Answer 2 · 2014-03-19T08:20:45.923

Method 1 Use a GroupBy and group them base on the columns required and u can make use of secondary sorting that is provided by the cascading ,by default it provies them in ascending order ,if we want them in descing order we can do them by reverseorder()

To get the TOP n tuples or rows

Its quite simple just use a static variable count in FILTER and increment it by 1 for each tuple count value increases by 1 and check weather it is greater than N

return true when count value is greater than N or else return false

this will provide the ouput with first N tuples

method 2

cascading provides an inbuit function unique which returns firstNbuffer

see the below link http://docs.cascading.org/cascading/2.2/javadoc/cascading/pipe/assembly/Unique.html

hadoop cascading how to get top N tuples

2 Answers2