my problem is that I have a 5 nodes Hadoop cluster, the files on the cluster takes 350 GB. I am running a Pig script which joins three different files and joins them. The job runs every time less thant 30 min to complete all the map tasks, and then 6 hours to complete the reduce tasks, all of these reduce tasks fail at the end in the best case. In the worst case my hadoop got stuck, caused by the namenode which goes in safemode cause it has not enough space(Quota exceeded).
The problem caused by the tmp directory which takes the hall available space(7TB!!). My script looks like this :
info_file = LOAD '$info' as (name, size, type,generation,streamId);
chunks_file = LOAD '$chunk' as (fp, size);
relation_file = LOAD '$relation' as (fp, filename);
chunks_relation = JOIN chunks_file BY fp, relation_file BY fp;
chunks_files= JOIN chunks_relation BY $3, info_file BY $0;
result = FOREACH chunks_files GENERATE $0,$1,$3,$5,$6,$7,$8;
STORE result INTO '$out';
Any Idea ??