0

I have following code in pig in which i am checking the field (srcgt & destgt in record) from main files stored in record for values as mentioned in another file(intlgt.txt) having values 338,918299,181,238 but it throws error as mentioned below. Can you please suggest how to overcome this on Apache Pig version 0.15.0 (r1682971).

Pig code:

record = LOAD '/u02/20160201*.SMS' USING PigStorage('|','-tagFile') ;
intlgtrec = LOAD '/u02/config/intlgt.txt' ; 
intlgt = foreach intlgtrec generate $0 as intlgt;
cdrfilter = foreach record generate (chararray) $1 as aparty, (chararray) $2 as bparty,(chararray) $3 as dt,(chararray)$4 as timestamp,(chararray) $29 as status,(chararray) $26 as srcgt,(chararray) $27 as destgt,(chararray)$0 as cdrfname ,(chararray) $13 as prepost;
intlcdrs = FILTER cdrfilter by ( STARTSWITH(srcgt,intlgt::intlgt) or STARTSWITH(destgt,intlgt::intlgt) ) ;` 

Error is:

WARN  org.apache.hadoop.mapred.LocalJobRunner - job_local1939982195_0002
java.lang.Exception: org.apache.pig.backend.executionengine.ExecException:   ERROR 0: Scalar has more than one row in the output. 1st : (338), 2nd :(918299)  (common cause: "JOIN" then "FOREACH ... GENERATE foo.bar" should be "foo::bar") at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
LiMuBei
  • 2,868
  • 22
  • 27
Amit
  • 89
  • 11
  • Please do attach the sample input file and the desired output. – hello_abhishek Mar 02 '16 at 06:31
  • Taking the error message as a hint, I'd say the problem is in the `filter` statement. More specifically in the `STARTSWITH` part. I guess it should be `intlgt.intlgt` instead of `intlgt::intlgt`. But even then I don't think this is gonna work the way you want it to as you're trying to filter by a field from a different relation. – LiMuBei Mar 02 '16 at 07:50

1 Answers1

0

When you are using

intlcdrs = FILTER cdrfilter by ( STARTSWITH(srcgt,intlgt::intlgt) or STARTSWITH(destgt,intlgt::intlgt) );

PIG is looking for a scalar. Be it a number, or a chararray; but a single one. So pig assumes your intlgt::intlgt is a relation with one row. e.g. the result of

intlgt = foreach (group intlgtrec all) generate COUNT_STAR(intlgtrec.$0) 

(this would generate single row, with the count of records in the original relation)

In your case, the intlgt contains more than one row, since you have not done any grouping on it. Based on your code, you're trying to look for SMS messages that had an intlgt on either end. Possible solutions:

  1. if your intlgt enteries all have the same length (e.g. 3) then generate substring(srcgt, 1, 3) as srcgtshort, and JOIN intlgt::intlgt with record::srcgtshort. this will give you the records where srcgt begins with a value from intlgt. Then repeat this for destgt.

  2. if they have a small number of lengths (e.g. some entries have length 3, some have length 4, and some have length 5) you can do the same thing, but it would be more laborious (as a field is required for each 'length').

  3. if the number of rows in the two relations is not too big, do a cross between them, which would create all possible combinations of rows from record and rows from intlgt. Then you can filter by STARTSWITH(srcgt, intlgt::intlgt), because the two of them are fields in the same relation. Beware of this approach, as the number of records can get HUGE!

Ran Locar
  • 561
  • 2
  • 6
  • Hi ran, can you please look into one of my post http://stackoverflow.com/questions/37119870/unable-to-pass-pig-tuple-to-python-udf – Amit May 10 '16 at 04:54