0

Following is the pig(0.15) script used for mapping the inputfile(cdrs as alias) with other file (mastergt as alias) & it is calling a python(2.7.11) udf for mapping the same, which is taking 40mins for say 4.5K records. Can you please suggest improvement.

Pig Script:

REGISTER 'smsiuc_udf.py' using streaming_python as smsiuc_udfs;

cdrs = load '2016040111*' USING PigStorage('|','-tagFile') ;

mastergtrec = load 'master.txt' USING PigStorage(',','-tagFile');

mastergt = FOREACH mastergtrec GENERATE (chararray) UPPER($1) as opcdpc, (chararray) UPPER($2) as gtoptname,(chararray) UPPER($3) as gtoptcircle;

cdrrecord = FOREACH cdrs GENERATE (chararray) UPPER($1) as aparty, (chararray) UPPER($2) as bparty,$3 as smssentdate,$4 as smssenttime,($29=='6' ? 'S' : 'F') as status,(chararray) UPPER($26) as srcgt,(chararray) UPPER($27) as destgt,($12=='405899136999995' ? 'MTSDEL-CDMA' : ($12=='919875089998' ? 'MTSRAJ-GSM' : ($12=='405899150999995' ? 'MTSCHN-CDMA' : $12) ) ) as smscgt, (chararray)$0 as cdrfname,(chararray) $13 as prepost;

filteredp2pcdrs = FILTER cdrrecord by smsiuc_udfs.pullp2pcdrs(aparty,bparty,srcgt,destgt) and status == 'S' and SUBSTRING(smssentdate,4,6) == '$MON';

groupp2pcdrs = GROUP filteredp2pcdrs by (srcgt,destgt,aparty,bparty,smscgt,status,prepost);

distinctp2pcdrs= FOREACH groupp2pcdrs {
    uniq = DISTINCT filteredp2pcdrs.(srcgt,destgt,aparty,bparty,smscgt,status,prepost);
    GENERATE FLATTEN(group),COUNT(uniq) as cnt;
    };

 p2preportmap = FOREACH distinctp2pcdrs GENERATE smsiuc_udfs.p2preport(srcgt,destgt,aparty,bparty),smscgt,status,prepost,cnt

Python UDF is as follows:

    def p2preport(srcgt,destgt,aparty,bparty):
    mastergt = {}
    masterlrn = {}
    origno = str(int(aparty))
    destno = str(int(bparty))
    returnstring = []
    try:
            if ((os.path.isfile(MASTERLRN) and os.access(MASTERLRN, os.R_OK) and os.stat(MASTERLRN).st_size > 0) and (os.path.isfile(MASTERGT) and os.access(MASTERGT, os.R_OK) and os.stat(MASTERGT).st_size > 0)):

                    #READ CONTENTS OF MASTER GT/LRN IN BAG/DICT
                    mastergt = readfileinbag(MASTERGT,1)
                    masterlrn = readfileinbag(MASTERLRN,2)
                    mastergtcircle = readfileinbag(MASTERGT,2)

                    if(srcgt in mastergt):
                            returnstring = mastergt[srcgt]
                    elif(srcgt[0:9] in mastergt):
                            returnstring = mastergt[srcgt[0:9]]
                    elif(srcgt[0:8] in mastergt):
                            returnstring = mastergt[srcgt[0:8]]
                    elif(srcgt[0:7] in mastergt):
                            returnstring = mastergt[srcgt[0:7]]
                    elif(srcgt[0:6] in mastergt):
                            returnstring = mastergt[srcgt[0:6]]
                    elif(srcgt[0:5] in mastergt):
                            returnstring = mastergt[srcgt[0:5]]
                    elif(srcgt[0:4] in mastergt):
                            returnstring = mastergt[srcgt[0:4]]
                    else:
                            returnstring = mastergt.get(srcgt,srcgt+",")

                    if destgt in mastergt:
                            returnstring = returnstring + "," + mastergt[destgt]
                    elif(destgt[0:9] in mastergt):
                            returnstring = returnstring + "," + mastergt[destgt[0:9]]
                    elif(destgt[0:8] in mastergt):
                            returnstring = returnstring + "," + mastergt[destgt[0:8]]
                    elif(destgt[0:7] in mastergt):
                            returnstring = returnstring + "," + mastergt[destgt[0:7]]
                    elif(destgt[0:6] in mastergt):
                            returnstring = returnstring + "," + mastergt[destgt[0:6]]
                    elif(destgt[0:5] in mastergt):
                            returnstring = returnstring + "," + mastergt[destgt[0:5]]
                    elif(destgt[0:4] in mastergt):
                            returnstring = returnstring + "," + mastergt[destgt[0:4]]
                    else:
                            returnstring = returnstring + mastergt.get(destgt,destgt+",")

   return returnstring

   except AttributeError:
            pass
Amit
  • 89
  • 11
  • what is taking 40 mins the pig script or the python udf? – Vikas Madhusudana May 09 '16 at 10:35
  • Python udf is taking 40 mins. Please help – Amit May 09 '16 at 11:40
  • is it ok a to share the sample data? also you can try to change the order of else if based on density of data. for eg if elif(srcgt[0:4] in mastergt): hit more times then have it as the first condition so you can avoid other slicing and comparison. – Vikas Madhusudana May 10 '16 at 03:03
  • 917732978625|9132018104293250|20160401|105944|null|null|null|0|null|0|2|405899136999995|postpaid|0|4294967040|null|null|null|null|919891030059|null|null|0|458|null|919891030059|161A|00000000|6|20160401105944. This is sample cdr. Can you please look into this and suggest. – Amit May 10 '16 at 04:12
  • Here are the list of problems i see for each row of input to UDF. 1) You do some checks to the mastergt file. 2) Load the file (3 files). 3) slice the destgt and srcgt to compare it against mastergt dict. All the above are costly. What you can do is load the dictionary as json file in pig and then pass to udf this will reduce the file io. It is hard to suggest without knowing the functionality. – Vikas Madhusudana May 10 '16 at 05:01
  • what values does srcmgt and destmgt take? – Vikas Madhusudana May 10 '16 at 05:09
  • thanks vikas, can you please throw some light on: 1) how to load the dictionary as json file in pig. 2) How to pass the dictionary to UDF as i am facing challenges currently to which i have raised a post in stackoverflow only(http://stackoverflow.com/questions/37119870/unable-to-pass-pig-tuple-to-python-udf). For your query of srcmgt & destmgt is having values in master.txt in col1. Following is the content of master.txt 010301,MTS,MM 010B06,MTS,TN 011407,MTS,MH 027406,MTS,AP 027807,MTS,AP 027900,MTS,AP 027A05,MTS,AS 027D01,MTS,BH – Amit May 10 '16 at 05:26
  • There is a JsonLoader in pig but you have to pass schema to load json file – Vikas Madhusudana May 10 '16 at 05:36
  • Thanks Vikas. For functionality part, in the distinctp2pcdrs we have to replace the contents of the srcgt,destgt with the contents of the master file in case it matches first col with second and third column. Any suggestion on this will be highly appreciable – Amit May 10 '16 at 06:31
  • so you mean if your srctgt is 919891030059 then it tries to match with first column of mastertgt and then tires to return the other rows. does master.txt change dynamically or it has fixed values?? – Vikas Madhusudana May 10 '16 at 07:24
  • why are you loading masterlrn = readfileinbag(MASTERLRN,2) mastergtcircle = readfileinbag(MASTERGT,2) i see you are not using this in the function can you remove these two lines – Vikas Madhusudana May 10 '16 at 07:26
  • master.txt has fixed values. MASTERLRN is loaded for some other purpose & has same context on which we are discussing. – Amit May 10 '16 at 08:31
  • if master.txt has fixed values then you can probably hardcode the dictionary. something like mastergt = { 010B06:(MTS,TN), ...} in your udf rather than loading from file each time. Also why do you need slicing of srctgt as mastergt index i of fixed length. – Vikas Madhusudana May 10 '16 at 08:35
  • Thanks will try this.... – Amit May 10 '16 at 08:50

0 Answers0