Speed up Hive or Pig aggregation by using pre-sorted data

Question

I want to speed up a simple Apache Hive (0.13.1) or Pig (version 0.12.0) aggregation job on Amazon EMR. My data is already sorted on the key that needs to be aggregated and I want the jobs to make use of that.

Hive:

[..some 'set' calls etc...]
CREATE EXTERNAL TABLE ngrams (gram string, year int, occurrences bigint,pages bigint, books bigint)
  ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
  STORED AS TEXTFILE
  LOCATION 's3://mybucket/3gram/';
INSERT OVERWRITE TABLE s3 select gram, sum(occurrences) from ngrams where year >= 1910 group by gram;

For Hive I couldn't find a way to tell it that the data is already sorted.

Pig:

ngrams = LOAD 's3://mybucket/3gram/' AS (ngram:chararray, year:int, counter:int, pages:int);
filtered = FILTER ngrams BY year >= 1910;
grouped = GROUP filtered BY (ngram);
summed = FOREACH grouped GENERATE group, SUM(filtered.counter);

For Pig, I found that GROUP ... USING 'collected' is supposed to make use of the sorting, but I get:

While using 'collected' on group; data must be loaded via loader implementing CollectableLoadFunc

So how can I load the data in a sorted way? I found examples with LOAD and USING org.apache.hadoop.zebra.pig.TableLoader() on the web, but Pig complains it doesn't know that class.

The pig example is almost complete already, only there's a `store summed into 's3://mybucket/3gram-pig-output';` at the end — Daniel Naber, Jun 28 '15 at 09:37
It is, except the loading phase where you LOAD data with zebra TableLoader. — glefait, Jun 28 '15 at 09:40
I tried `ngrams = LOAD 's3://mybucket/3gram/' USING org.apache.hadoop.zebra.pig.TableLoader('ngram, year, counter, pages', 'sorted');` (locally, not in AWS - when I try it in AWS the job fails but I didn't find a useful error message yet) — Daniel Naber, Jun 28 '15 at 10:04

score 0 · Answer 1 · answered Jun 28 '15 at 16:24

First you need to REGISTER zebra if it is not part of hadoop.

If you need to build the jar:

get the source of PIG 12.0
compile pig
compile zebra ( ant zebra)

REGISTER '/yourpath/pig-0.12.0/build/contrib/zebra/zebra-0.8.0-dev.jar';

Second, as far as I know (and tried), you cannot LOAD data with row format (usual text-file) with TableLoader. The data have to be stored previously with TableStorer that will write data in column-oriented format, with schema included.

You may try this and check the ouput/errors:

ngrams_row = LOAD 's3://mybucket/3gram/' AS (ngram: chararray, year:int, counter: int, pages: int);
STORE ngrams_row INTO 's3://mybucket/3gram-zebra/' using org.apache.hadoop.zebra.pig.TableStorer('[ngram];[year,counter,pages]');
ngrams_zebra = LOAD 's3://mybucket/3gram-zebra/' USING org.apache.hadoop.zebra.pig.TableLoader('ngram,year,counter,pages', 'sorted'); 

DESCRIBE ngrams_zebra;
DUMP ngrams_zebra;

Thanks, the conversion step works but then I get `java.io.IOException: The table is not sorted`. The data is sorted, maybe in a different way than zebra expects. Also, I see zebra is deprecated (https://issues.apache.org/jira/browse/PIG-3996) and the original idea was to speed up aggregation and I'm not sure that's possible if I have yet another step to run. — Daniel Naber, Jun 30 '15 at 08:10
It won't be a good idea if you have to write and read once. However, if you have data that will be wrinte once and read thousands of time, it can make sense. About the sorting issue, just try to sort by ngrams before storing it. — glefait, Jun 30 '15 at 08:19

Speed up Hive or Pig aggregation by using pre-sorted data

1 Answers1