0

my problem is that I have a 5 nodes Hadoop cluster, the files on the cluster takes 350 GB. I am running a Pig script which joins three different files and joins them. The job runs every time less thant 30 min to complete all the map tasks, and then 6 hours to complete the reduce tasks, all of these reduce tasks fail at the end in the best case. In the worst case my hadoop got stuck, caused by the namenode which goes in safemode cause it has not enough space(Quota exceeded).

The problem caused by the tmp directory which takes the hall available space(7TB!!). My script looks like this :

info_file = LOAD '$info' as (name, size, type,generation,streamId);
chunks_file = LOAD '$chunk' as (fp, size);
relation_file = LOAD '$relation' as (fp, filename);

chunks_relation = JOIN chunks_file BY fp, relation_file BY fp;
 chunks_files= JOIN chunks_relation BY $3, info_file BY $0;

result = FOREACH chunks_files  GENERATE  $0,$1,$3,$5,$6,$7,$8;
STORE  result INTO '$out';

Any Idea ??

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
Bafla13
  • 132
  • 2
  • 11

1 Answers1

0

Your script looks fine. What is the size of the files that you are joining?

Join is a costly operator any where. You can optimize the joins by using replicated,skew, merge joinsin Pig. Go through these joins documentation once and apply based on your file sizes and requirement.

https://bluewatersql.wordpress.com/category/Pig/

gbharat
  • 276
  • 1
  • 4
  • Do you mean it is normal if 3 file which are together 20 GB, would occupy more than 7 TB by every join? – Bafla13 Jan 30 '15 at 19:54