1

I'm having an interesting behaviour with PigStorage and its -tagPath option, where I do not know if I am doing something wrong (wrong schema definition?) or if this is a limitation/bug in Pig.

My file looks like this (the most basic, I was able to come up with):

A
B

Now I can load and subselect this file like this fine:

vals = LOAD '/user/guest/test.txt'
    USING PigStorage(';') AS (char: chararray);

DUMP vals

one_column = FOREACH vals GENERATE char;

DUMP one_column

Results in:

(A)
(B)
(A)
(B)

However, when I try to fetch the filepath with -tagPath (I need it when I access a whole folder of data), the data gets loaded correctly into the first variable, but I cannot subselect a column from it.

vals = LOAD '/user/guest/test.txt'
    USING PigStorage(';', '-tagPath')
    AS (filepath: chararray, char: chararray);

DUMP vals

one_column = FOREACH vals GENERATE char;

DUMP one_column

Results in:

(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt,A)
(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt,B)
(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt)
(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt)

However, when I first read the data without schema and then add a schema using FOREACH it works fine again:

vals = LOAD '/user/guest/test.txt'
    USING PigStorage(';', '-tagPath');

vals_n = FOREACH vals GENERATE (chararray)$0 AS filepath, (chararray)$1 AS char;

DUMP vals_n

one_column = FOREACH vals GENERATE char;

DUMP one_column

Results in:

(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt,A)
(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt,B)
(A)
(B)

So is there any way, I can use -tagPath and schema in the LOAD phase at the same time?

aufziehvogel
  • 7,167
  • 5
  • 34
  • 56
  • 1
    This looks weird , I am noticing same result too. I checked the Java Class for PigStorage. It sends the src_path with the schema in return. Though tagPath is deprecated and tagFile is replaced with latest versions. tagFile has the same behaviour as well. I am interested in know the reason. Did you raise this as a pig issue ? – Govind Aug 13 '15 at 02:03
  • No, haven't yet. Since I am a pig newcomer, I didn't know if I was wrong. So I'll check that from office tomorrow. – aufziehvogel Aug 13 '15 at 18:01
  • Do post the answer here, if you get any.Thanks ! – Govind Aug 13 '15 at 18:10
  • 1
    Haven't tested yet, but here seems to be a solution: http://www.webopius.com/content/764/resolved-apache-pig-with-tagsource-tagfile-option-generates-incorrect-columns – aufziehvogel Aug 14 '15 at 08:34
  • I tried it . And Yeah , its working . – Govind Aug 14 '15 at 14:57

1 Answers1

0

This happens, because pig tries to find out automatically which columns are being used in the script and only load those. When we use -tagFile or -tagPath, it seems this gets confused.

The solution is to run the pig script without this column detection:

pig -x mapreduce -t ColumnMapKeyPrune
aufziehvogel
  • 7,167
  • 5
  • 34
  • 56