I have 300K small .bz2 files on HDFS. I am writing a pig job to merge all the files and produce 500 output .bz2 files. The total size of the small files were 300GB. After merging, the total size of the 500 merged files are around 500GB. This is the pig script I have used.
data = load 'inputFolder'; -- 300K files with total size 300GB
data = DISTINCT data PARALLEL 500;
store data into 'outputFolder'; --500 files with total size 500GB
Can you explain me how this space is getting increased? Is there any alternate method on Pig to do the same?
Thanks, Tony