9

I'm trying to write a pig latin script to pull the count of a dataset that I've filtered.

Here's the script so far:

/* scans by title */

scans           = LOAD '/hive/scans/*' USING PigStorage(',') AS (thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray);
productscans    = FILTER scans BY (title MATCHES 'proactiv');
scancount       = FOREACH productscans GENERATE COUNT($0);
DUMP scancount;

For some reason, I get the error:

Could not infer the matching function for org.apache.pig.builtin.COUNT as multiple or none of them fit. Please use an explicit cast.

What am I doing wrong here? I'm assuming it has something to do with the type of the field I'm passing in, but I can't seem to resolve this.

TIA, Jason

JasonA
  • 314
  • 2
  • 4
  • 11

3 Answers3

16

Is this what you're looking for (group by all to bring everything into one bag, then count the items):

scans           = LOAD '/hive/scans/*' USING PigStorage(',') AS (thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray);
productscans    = FILTER scans BY (title MATCHES 'proactiv');
grouped         = GROUP productscans ALL;
count           = FOREACH grouped GENERATE COUNT(productscans);
dump count;
Chris White
  • 29,949
  • 4
  • 71
  • 93
7

COUNT requires a preceding GROUP ALL statement for global counts and a GROUP BY statement for group counts.

You can use any of below :

scans           = LOAD '/hive/scans/*' USING PigStorage(',') AS (thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray);
productscans    = FILTER scans BY (title MATCHES 'proactiv');
grouped         = GROUP productscans ALL;
count           = FOREACH grouped GENERATE COUNT(productscans);
DUMP scancount;

Or

scans           = LOAD '/hive/scans/*' USING PigStorage(',') AS (thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray);
productscans    = FILTER scans BY (title MATCHES 'proactiv');
grouped         = GROUP productscans ALL;
count           = FOREACH grouped GENERATE COUNT($1);
DUMP scancount;
Sanjiv
  • 1,795
  • 1
  • 29
  • 45
0

Maybe

/* scans by title */

scans           = LOAD '/hive/scans/*' USING PigStorage(',') AS (thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray);
productscans    = FILTER scans BY (title MATCHES 'proactiv');
scancount       = FOREACH productscans GENERATE COUNT(productscans);
DUMP scancount;
whoisjake
  • 622
  • 4
  • 7
  • 1
    thanks Jake - unfortunately, no luck. that gives me: `Invalid scalar projection: productscans : A column needs to be projected from a relation for it to be used as a scalar` – JasonA Mar 22 '12 at 20:25