0

i have some memory issues in pig.

So this is my code.

a = load 'some file'; 
b = load 'file2';
cond = load 'cond file';

c = union a,b;
cc = join c by $0, cond by $0;
dd = foreach cc generate $0,$1;
reduce = foreach(group dd generate by random()) generate flatten (dd);

cc = join c by $1, cond by $0;
dd = foreach cc generate $1,$2;
reduce2 = foreach(group dd generate by random()) generate flatten (dd);

final = union reduce, reduce2; 

store final into 'final_output'; 

Will there be any issues with the code? I tried running it and testing on a small sample size and it seems fine. But i am not sure will it have any implications that i am unaware about.

Ignoring the code quality as i know that this is not a good way to write scripts or coding in general. however, this is just a one-use script.

aceminer
  • 4,089
  • 9
  • 56
  • 104

1 Answers1

0

Short Answer: No issues.

Long Answer: Pig latin variables are like any other programming language variables. You have a java program, you declare a variable for purpose A and later down the line you decide to reuse that variable for purpose B, purpose C..etc. There is nothing wrong with this approach, as long as it meets your end result. Most of the performance centric code do this using bit manipulation, you can see this kind code of most commonly in embedded systems. Coming to your use case, pig latin is used for batch processing of huge dataset/events. So, the amount of data one process is here is not comparable to embedded systems. Reusing variables shouldn't give any extra benefit in terms of performance. The downside to this approach is that your ETL pipelines would be difficult to read/understand and possibly prone to more bugs. As a result, it's not a recommended practice.

Raghu Kumar
  • 118
  • 5