Amazon EMR-4.5, Hadoop 2.7.2, Pig 0.14
I would like to project the file name field and selected fields to a new relation after loading using the -tagFile option. The results do not seem to make sense. Examples:
tagfile-test.txt (tab-delimited)
AAA 123 2016
BBB 456 2016
CCC 789 2016
Load-Dump
test = LOAD 'tagfile-test.txt' USING PigStorage('\t','-tagFile') AS (f0, f1, f2, f3);
DUMP test;
(tagfile-test.txt,AAA,123,2016)
(tagfile-test.txt,BBB,456,2016)
(tagfile-test.txt,CCC,789,2016)
Correct - GENERATE f0, f1, f2
test = LOAD 'tagfile-test.txt' USING PigStorage('\t','-tagFile') AS (f0, f1, f2, f3);
project = FOREACH test GENERATE f0, f1, f2;
DUMP project;
(tagfile-test.txt,AAA,123)
(tagfile-test.txt,BBB,456)
(tagfile-test.txt,CCC,789)
Incorrect - GENERATE f0, f1, f3 (result same as above)
test = LOAD 'tagfile-test.txt' USING PigStorage('\t','-tagFile') AS (f0, f1, f2, f3);
project = FOREACH test GENERATE f0, f1, f3;
DUMP project;
(tagfile-test.txt,AAA,123)
(tagfile-test.txt,BBB,456)
(tagfile-test.txt,CCC,789)
Incorrect - GENERATE f0, f2, f3 (confirm)
test = LOAD 'tagfile-test.txt' USING PigStorage('\t','-tagFile') AS (f0, f1, f2, f3);
project = FOREACH test GENERATE f0, f2, f3;
DUMP project;
(tagfile-test.txt,AAA,2016)
(tagfile-test.txt,BBB,2016)
(tagfile-test.txt,CCC,2016)
It seems Pig is not correctly identifying the field names. I have tried using field positions ($0, $1, $2, $3) with the same results.