PIG Streaming: _some_ output files are missing

Question

The problem can be reproduced using a simple test.
The "pig" script is as follows:

SET pig.noSplitCombination true;
dataIn = LOAD 'input/Test';
DEFINE macro `TestScript` input('DummyInput.txt') output('A.csv', 'B.csv', 'C.csv', 'D.csv', 'E.csv') ship('TestScript');
dataOut = STREAM dataIn through macro;
STORE dataOut INTO 'output/Test';

The actual script is a complex R program but here is a simple "TestScript" that reproduces the problem and doesn't require R:

# Ignore the input coming from the 'DummyInput.txt' file
# For now just create some output data files

echo "File A" > A.csv
echo "File B" > B.csv
echo "File C" > C.csv
echo "File D" > D.csv
echo "File E" > E.csv

The input 'DummyInput.txt' is some dummy data for now.

Record1
Record2
Record3

For the test, I've load the the dummy data in HDFS using the following script. This will result in 200 input files.

for i in {0..199}
do
    hadoop fs -put DummyInput.txt input/Test/Input$i.txt
done

When I run the pig job, it runs without errors. 200 mappers run as expected. However, I expect to see 200 files in the various HDFS directories. Instead I find that a number of the output files are missing:

       1          200               1400 output/Test/B.csv
       1          200               1400 output/Test/C.csv
       1          189               1295 output/Test/D.csv
       1          159               1078 output/Test/E.csv

The root "output/Test" has 200 files, which is correct. Folders "B.csv" and "C.csv" have 200 files as well. However, folders "D.csv" and "E.csv" have missing files.

We have looked at the logs but can't anything which points to why the local output files are not being copied from the data nodes to HDFS.

Do you actually have fewer lines in the outputfiles, or just fewer files (e.g. because some have been optimized out). — Dennis Jaheruddin, Jun 06 '16 at 10:26

PIG Streaming: _some_ output files are missing

0 Answers0