I want to speed up a simple Apache Hive (0.13.1) or Pig (version 0.12.0) aggregation job on Amazon EMR. My data is already sorted on the key that needs to be aggregated and I want the jobs to make use of that.
Hive:
[..some 'set' calls etc...]
CREATE EXTERNAL TABLE ngrams (gram string, year int, occurrences bigint,pages bigint, books bigint)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3://mybucket/3gram/';
INSERT OVERWRITE TABLE s3 select gram, sum(occurrences) from ngrams where year >= 1910 group by gram;
For Hive I couldn't find a way to tell it that the data is already sorted.
Pig:
ngrams = LOAD 's3://mybucket/3gram/' AS (ngram:chararray, year:int, counter:int, pages:int);
filtered = FILTER ngrams BY year >= 1910;
grouped = GROUP filtered BY (ngram);
summed = FOREACH grouped GENERATE group, SUM(filtered.counter);
For Pig, I found that GROUP ... USING 'collected'
is supposed to make use of the sorting, but I get:
While using 'collected' on group; data must be loaded via loader implementing CollectableLoadFunc
So how can I load the data in a sorted way? I found examples with LOAD
and USING org.apache.hadoop.zebra.pig.TableLoader()
on the web, but Pig complains it doesn't know that class.