Unable to pass pig tuple to python UDF

Question

I have master.txt which has 10K records, so each line of it will be a tuple & whole of the same needs to be passed to python UDF. Since it has multiple records, so on storing p2preportmap getting following error. Please help

Error is as follows:

Unable to open iterator for alias p2preportmap. Backend error : org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (010301,MTS,MM), 2nd :(010B06,MTS,TN) (common cause: "JOIN" then "FOREACH ... GENERATE foo.bar" should be "foo::bar" )

Pig Script is as follows:

REGISTER 'smsiuc_udf.py' using streaming_python as smsiuc_udfs;
cdrs = load '2016040111*' USING PigStorage('|','-tagFile') ;

mastergtrec = load 'master.txt' USING PigStorage(',','-tagFile');

mastergt = FOREACH mastergtrec GENERATE (chararray) UPPER($1) as opcdpc, (chararray) UPPER($2) as gtoptname,(chararray) UPPER($3) as gtoptcircle;

mastergttup = FOREACH mastergt generate TOTUPLE(opcdpc,gtoptname,gtoptcircle) as mstgttup;

cdrrecord = FOREACH cdrs GENERATE (chararray) UPPER($1) as aparty, (chararray) UPPER($2) as bparty,$3 as smssentdate,$4 as smssenttime,($29=='6' ? 'S' : 'F') as status,(chararray) UPPER($26) as srcgt,(chararray) UPPER($27) as destgt,($12=='405899136999995' ? 'MTSDEL-CDMA' : ($12=='919875089998' ? 'MTSRAJ-GSM' : ($12=='405899150999995' ? 'MTSCHN-CDMA' : $12) ) ) as smscgt, (chararray)$0 as cdrfname,(chararray) $13 as prepost;

filteredp2pcdrs = FILTER cdrrecord by smsiuc_udfs.pullp2pcdrs(aparty,bparty,srcgt,destgt) and status == 'S' and SUBSTRING(smssentdate,4,6) == '$MON';

groupp2pcdrs = GROUP filteredp2pcdrs by (srcgt,destgt,aparty,bparty,smscgt,status,prepost);

distinctp2pcdrs= FOREACH groupp2pcdrs {
uniq = DISTINCT filteredp2pcdrs.(srcgt,destgt,aparty,bparty,smscgt,status,prepost);
GENERATE FLATTEN(group),COUNT(uniq) as cnt;
};

p2preportmap = FOREACH distinctp2pcdrs GENERATE smsiuc_udfs.p2preport(srcgt,destgt,aparty,bparty,mastergttup ),smscgt,status,prepost,cnt

score 2 · Answer 1 · answered May 16 '16 at 07:57

2

Let me give you a example I have two relation A and B

A

1,2,3
3,4,5
4,5,6

B

Now i want a python udf that would lookup the first column of the A print output something like this below.

    ((1,{(1,2,3)}))
((2,))
((3,{(3,4,5)}))
((1,{(1,2,3)}))
((2,))
((3,{(3,4,5)}))
((1,{(1,2,3)}))
((2,))
((3,{(3,4,5)}))

So first i group A by first column and then group it by 1 so that i have single row

c = group A by $0
e = group c by 1

python udf is something like below

def pythonudf(value,map):
    print map
    temp = None
    for a in map:
        if a[0] == value:
            temp = a[1]
    return value,temp

now you use this udf

D = foreach B generate myudf.pythonudf($0,e.$1);

answered May 16 '16 at 07:57

Vikas Madhusudana

1,482
1
10
20

Thanks for such a wonderful explanation, i have been trying to do the same but still it is not working. when i dump group e i get this `1 {(master_gt_spc.txt,{(master.txt,9145,MTS,UPW),(master.txt,919225,MTS,WB),(master.txt,0101A0,MTS,TN),(master_gt_spc.txt,03F0,MTS,KL),(master.txt,YK,MTSI,KO),,(master_gt_spc.txt,YD-INDRLY,MTS,DL)})}`. When i print `a[0]` it gives `master.txt`, `a[1]` doesnt print in the UDF, `a[2]` also doesnt print. Python udf is `def map(srcgt,mastergttup): for tuplevalue in mastergttup: if tuplevalue[0] == srcgt: return tuplevalue[1]`. – Amit May 17 '16 at 10:57
Output is not getting mapped as required. – Amit May 17 '16 at 10:58
looks like something is wrong with your relation e why is it having master_gt_spc.txt and master.txt it should be srcgt values right? can you print c before grouping something went wrong while grouping. – Vikas Madhusudana May 17 '16 at 11:14
yes, it is actually `1 {(master.txt,{(master.txt,9145,MTS,UPW),(master.txt,919225,MTS,WB),(maste‌r.txt,0101A0,MTS,TN),(master_gt_spc.txt,03F0,MTS,KL),(master.txt,YK,MTSI,KO),(ma‌ster_gt_spc.txt,YD-INDRLY,MTS,DL)})}` and print c is `master.txt {(master.txt,{(master.txt,9145,MTS,UPW),(master.txt,919225,MTS,WB),(maste‌r.txt,0101A0,MTS,TN),(master_gt_spc.txt,03F0,MTS,KL),(master.txt,YK,MTSI,KO),(ma‌ster_gt_spc.txt,YD-INDRLY,MTS,DL)})}`. I think we should group all and den ??? – Amit May 17 '16 at 11:23
can you print A looks like the way you are getting c is wrong – Vikas Madhusudana May 17 '16 at 11:30
I think we should group all and den can c `all {(9145,MTS,UPW),(919225,MTS,WB),(0101A0,MTS,TN),(03F0,MTS,KL),(YK,MTSI,KO),(YD-INDRLY,MTS,DL)})}`, but dis also is not working – Amit May 17 '16 at 11:35
Lets say you want a map to be passed to python udf you create a groupby column and then group by 1 and then pass it to python udf. Now if you want multiple maps then you create multiple maps and group each by one and pass them to udf something like pythonudf(row,map1,map2...) i thing you were trying to merge all map to a single map. – Vikas Madhusudana May 17 '16 at 11:41
If you provide with sample data I can give a try. – Vikas Madhusudana May 17 '16 at 15:22
sent on your gmail id. – Amit May 18 '16 at 03:28
Vikas did u by any chance cracked the issue?? – Amit May 19 '16 at 05:37
I sent a reply to your email. i was able to group and run the test i also sent you a sample of output – Vikas Madhusudana May 19 '16 at 05:39
In case you can help on more issue posted at http://stackoverflow.com/questions/37315344/pig-pivoting-sum-3-relations. Will be obliged – Amit May 19 '16 at 11:29

Vikas Madhusudana · Accepted Answer · 2016-05-12T10:58:14.453

1

This can be done by adding a dummy column and then grouping.

dummmy= foreach p2preportmap generate 1, $0,$1 ....

grouped = group dummy by $0

edited May 12 '16 at 10:58

answered May 10 '16 at 05:54

Vikas Madhusudana

1,482
1
10
20

problem here is you have multiple records in your relation p2preportmap you should combine them into one this can be done by above command or simpler one is grouped = group p2preportmap by 1; now you can pass grouped into the udf – Vikas Madhusudana May 10 '16 at 07:17
So u mean to say that i shud do group on mastergt like dummy= foreach mastergtrec generate 1,opcdpc,gtoptname,gtoptcircle & then grouped = group dummy by opcdpc. – Amit May 10 '16 at 10:00
no group dummy by $0 so that you get a single row. – Vikas Madhusudana May 10 '16 at 10:42
but this gives error of Scalar has more than one row in the output. 1st : (ALFA), 2nd :(AMIT) (common cause: "JOIN" then "FOREACH ... GENERATE foo.bar" should be "foo::bar). Can you please suggest – Amit May 12 '16 at 08:27
can you dump dummy and see if it has a single row. if you group by one then it should have a single row. – Vikas Madhusudana May 12 '16 at 10:54
Got it.Now it is working.`p2preportmap = FOREACH distinctp2pcdrs GENERATE smsiuc_udfs.p2preport(srcgt,destgt,aparty,bparty,mastergttup.group),smscgt,status,prepost,cnt`. But now problem is on using the same in python udf i am unable to map the same in proper manner. Python udf is ` for counter,value in enumerate(mastergt): splitline= value.strip().split(",") optnamecircle[splitline[0]]=",".join(splitline[1:]) mastergtcircle[splitline[0]]=",".join(splitline[2:])` – Amit May 12 '16 at 11:00
Output without mapping is coming with first 3 chars missing in srcgt. `((CC_SELFCARE1,919810051023,,155,),MTSRAJ-GSM,S,postpaid,3) ((CC_SELFCARE1,919810051025,,155,),MTSRAJ-GSM,S,postpaid,2) ((CC_SELFCARE1,919825691400,,155,),MTSRAJ-GSM,S,postpaid,35) ((CC_SELFCARE1,919825691400,,155,),MTSRAJ-GSM,S,postpaid,1)`. Rest is fine otherwise, i am sure i am not iterating the same properly. – Amit May 12 '16 at 11:02
So you are loosing some characters of srcgt inside python udf? – Vikas Madhusudana May 13 '16 at 03:12
yes and mapping is also not done. can you please advice on this. – Amit May 13 '16 at 07:14
I am not getting you can you explain what you are looking and where you are stuck – Vikas Madhusudana May 13 '16 at 15:38
I have a file in which contents are like `010301,MTS,MM 010B06,MTS,TN 011407,MTS,MH 027406,MTS,AP 027807,MTS,AP 027900,MTS,AP 027A05,MTS,AS 027D01,MTS,BH 028103,MTS,MP 028800,MTS,GJ` which is grouped(mastergttup.group) and den passed to the function as suggested by you. On passing the same using `p2preportmap = FOREACH distinctp2pcdrs GENERATE smsiuc_udfs.p2preport(srcgt,destgt,aparty,bparty,mastergttup.group),smscgt,statu‌s,prepost,cnt` getting d correct values in udf, but unable to map the 1st column of mastergttup wid d srcgt in python udf, I am iterating using d above code – Amit May 16 '16 at 06:38
mastergttup.group will be 1 you should pass the entire row that is mastergttup then access $1 – Vikas Madhusudana May 16 '16 at 07:14
Yes, my understanding was also on that lines, but it gives `A column needs to be projected from a relation for it to be used as a scalar`. So passed it like this. – Amit May 16 '16 at 07:46

Unable to pass pig tuple to python UDF

2 Answers2

Linked