I'm having an interesting behaviour with PigStorage
and its -tagPath
option, where I do not know if I am doing something wrong (wrong schema definition?) or if this is a limitation/bug in Pig.
My file looks like this (the most basic, I was able to come up with):
A
B
Now I can load and subselect this file like this fine:
vals = LOAD '/user/guest/test.txt'
USING PigStorage(';') AS (char: chararray);
DUMP vals
one_column = FOREACH vals GENERATE char;
DUMP one_column
Results in:
(A)
(B)
(A)
(B)
However, when I try to fetch the filepath with -tagPath
(I need it when I access a whole folder of data), the data gets loaded correctly into the first variable, but I cannot subselect a column from it.
vals = LOAD '/user/guest/test.txt'
USING PigStorage(';', '-tagPath')
AS (filepath: chararray, char: chararray);
DUMP vals
one_column = FOREACH vals GENERATE char;
DUMP one_column
Results in:
(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt,A)
(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt,B)
(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt)
(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt)
However, when I first read the data without schema and then add a schema using FOREACH
it works fine again:
vals = LOAD '/user/guest/test.txt'
USING PigStorage(';', '-tagPath');
vals_n = FOREACH vals GENERATE (chararray)$0 AS filepath, (chararray)$1 AS char;
DUMP vals_n
one_column = FOREACH vals GENERATE char;
DUMP one_column
Results in:
(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt,A)
(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt,B)
(A)
(B)
So is there any way, I can use -tagPath
and schema in the LOAD
phase at the same time?