1

first file contain the following

cl_id   date         TM        c_id      c_val
10201   2015-4-15  01:00:00  56707065  0
10201   2015-4-15  01:00:00  56707066  1
10201   2015-4-15  01:00:00  56707067  200

like wise there are multiple cl_id and for each cl_id the c_id is different and the c_value is different.
similarly in the second file

cl_id   dt         tm        c_id      c_val
10201   2015-4-15  01:00:00  56707065  300
10201   2015-4-15  01:00:00  56707066  60
10201   2015-4-15  01:00:00  56707067  20

All the values are same in the file one and file two only the counter value changes as per the c_id so I want the third file which contains the sum of c_val i.e for cl_id 10201 & for the c_id 56707065 i want the result like this 10201 2015-4-15 01:00:00 56707065 0+300 =300 so finally the output in third file will be,

10201   2015-4-15  01:00:00  56707065 300

similarly for c_id 56707066,56707067 aggregate the result and put it into third file. please suggest me the pig script how i can do that.

Kishore
  • 5,761
  • 5
  • 28
  • 53
Deepak Patil
  • 99
  • 1
  • 11

1 Answers1

1
Dump A;
cl_id   date         TM        c_id      c_val
10201   2015-4-15  01:00:00  56707065  0
10201   2015-4-15  01:00:00  56707066  1
10201   2015-4-15  01:00:00  56707067  200

Dump B;
cl_id   dt         tm        c_id      c_val
10201   2015-4-15  01:00:00  56707065  300
10201   2015-4-15  01:00:00  56707066  60
10201   2015-4-15  01:00:00  56707067  20

C = JOIN A BY (cl_id, c_id), B BY (cl_id,c_id);

D = foreach C generate $0,$1,$2,$3,$4+$9;

Dump D;
(10201,2015-4-15,01:00:00,56707065,300)
(10201,2015-4-15,01:00:00,56707066,61)
(10201,2015-4-15,01:00:00,56707067,220)

STORE D INTO '/home/infoobjects/aa.csv' using PigStorage(',');
Kishore
  • 5,761
  • 5
  • 28
  • 53
  • hi krish , thanks for the answer can you please tell me what is the use of $0,$1.,$4$9 here – Deepak Patil Sep 04 '15 at 12:39
  • Can we convert the same script into python spark. – Deepak Patil Sep 28 '15 at 10:15
  • $0,$1.,$4$9 is index no. for example if you have 3 column - id, name, college. then you can get id by $0, name by $1 and college by $2 – Kishore Sep 28 '15 at 11:03
  • thanks kishore. do u have any idea that how we can convert this in to python spark code.. like first load the two files from hdfs and perform joins or any other transformation to achieve the aggregated results – Deepak Patil Sep 28 '15 at 11:06
  • That is for a different question altogether. You may create a new question if you want. make sure you add a python tag. – Kishore Sep 28 '15 at 11:29