0

Amazon EMR-4.5, Hadoop 2.7.2, Pig 0.14

I would like to project the file name field and selected fields to a new relation after loading using the -tagFile option. The results do not seem to make sense. Examples:

tagfile-test.txt (tab-delimited)

AAA    123    2016
BBB    456    2016
CCC    789    2016

Load-Dump

test = LOAD 'tagfile-test.txt' USING PigStorage('\t','-tagFile') AS (f0, f1, f2, f3);
DUMP test;

(tagfile-test.txt,AAA,123,2016)
(tagfile-test.txt,BBB,456,2016)
(tagfile-test.txt,CCC,789,2016)

Correct - GENERATE f0, f1, f2

test = LOAD 'tagfile-test.txt' USING PigStorage('\t','-tagFile') AS (f0, f1, f2, f3);
project = FOREACH test GENERATE f0, f1, f2;
DUMP project;

(tagfile-test.txt,AAA,123)
(tagfile-test.txt,BBB,456)
(tagfile-test.txt,CCC,789)

Incorrect - GENERATE f0, f1, f3 (result same as above)

test = LOAD 'tagfile-test.txt' USING PigStorage('\t','-tagFile') AS (f0, f1, f2, f3);
project = FOREACH test GENERATE f0, f1, f3;
DUMP project;

(tagfile-test.txt,AAA,123)
(tagfile-test.txt,BBB,456)
(tagfile-test.txt,CCC,789)

Incorrect - GENERATE f0, f2, f3 (confirm)

test = LOAD 'tagfile-test.txt' USING PigStorage('\t','-tagFile') AS (f0, f1, f2, f3);
project = FOREACH test GENERATE f0, f2, f3;
DUMP project;

(tagfile-test.txt,AAA,2016)
(tagfile-test.txt,BBB,2016)
(tagfile-test.txt,CCC,2016)

It seems Pig is not correctly identifying the field names. I have tried using field positions ($0, $1, $2, $3) with the same results.

chillvibes
  • 39
  • 2

2 Answers2

1

I faced the same issue when used tagFile option with pigstorage and solved the problem by adding below line in pig script:

set pig.optimizer.rules.disabled 'ColumnMapKeyPrune';

ColumnMapKeyPrune is well explained at http://chimera.labs.oreilly.com/books/1234000001811/ch07.html#debugging_tips

OneUser
  • 188
  • 16
0

It looks like the fields are separated by ',' but you are using '\t' as the delimiter in PigStorage.Also specify the datatype for the fields.

Try changing this

test = LOAD 'tagfile-test.txt' USING PigStorage('\t','-tagFile') AS (f0, f1, f2, f3);

To

test = LOAD 'tagfile-test.txt' USING PigStorage(',','-tagFile') AS (f0:chararray, f1:chararray, f2:int, f3:int);
nobody
  • 10,892
  • 8
  • 45
  • 63
  • Apologies, the file is in fact tab-separated. I re-formatted the input file in the question to make this clear. I have tried adding datatypes, same issue. With datatypes these return blank (in the case of f2 in the last example) because the datatypes are not correct, either. – chillvibes Apr 12 '16 at 20:38