Skipping the header while loading the text file using Piglatin

Question

I have a text file and it's first row contains the header. Now I want to do some operation on the data, but while loading the file using PigStorage it takes the HEADER too. I just want to skip the HEADER. Is it possible to do so(directly or through a UDF)?

This is the command which i'm using to load the data:

input_file = load '/home/hadoop/smdb_tracedata.csv'
USING PigStorage(',')
as (trans:chararray, carrier:chararray,aainday:chararray);

Please post the code you have tried. And before you do, take a brief look at http://sscce.org. — Erik Kaplun, Oct 01 '13 at 11:45
Dude, that goes in the question, not the comments. Also, this does not look like Python to me at all. Why did you tag the question with "python"? — Erik Kaplun, Oct 01 '13 at 11:55
Please pay attention to what I said in my 2 comments; otherwise the negative votes will just keep coming; in addition, I would draw your attention to the fact that Stack Overflow has perfectly nice formatting features, so please use them—it's hard to read what you posted otherwise. — Erik Kaplun, Oct 01 '13 at 12:04
@ErikAllik Just so you know, he likely tagged the question with python because pig functions can be written in python. Also, for questions like this in pig, it is very difficult to produce a sscce because of the documentation. — mr2ert, Oct 01 '13 at 12:39

score 10 · Answer 1 · answered Oct 01 '13 at 15:20

10

Usually the way I solve this problem is to use a FILTER on something I know is in the header. For example, consider the following data example:

STATE,NAME
MD,Bob
VA,Larry

I'll do:

B = FILTER A BY state != 'STATE';

answered Oct 01 '13 at 15:20

Donald Miner

38,889
8
95
118

2

This seems to be the only answer that works for multiline headers in multiple files. – Dennis Jaheruddin Aug 22 '16 at 08:19
This only works if the header has the same columns as the data – Balint Bako Dec 23 '16 at 11:39

Davis Broda · Accepted Answer · 2013-10-01T17:53:38.033

9

If you have pig version 0.11 you could try this:

input_file = load '/home/hadoop/smdb_tracedata.csv' USING PigStorage(',') as (trans:chararray, carrier :chararray,aainday:chararray);

ranked = rank input_file;

NoHeader = Filter ranked by (rank_input_file > 1);

Ordered = Order NoHeader by rank_input_file

New_input_file = foreach Ordered Generate trans, carrier, aainday;

This would get rid of the first row, leaving New_input_file exactly the same as the original, without the header row (assuming header row is the first row in the file). Please note that the rank operator is only available in pig 0.11, so if you have an earlier version you will need to find another way.

Edit: added the ordered line in order to make sure New_input_file maintains the same order as the original input file

edited Oct 01 '13 at 17:53

answered Oct 01 '13 at 12:42

Davis Broda

4,102
5
23
37

2

Note that this won't work if you need to load multiple csv files. Also, are you sure that the lines in `input_file` will still be in the same order as in the file? – mr2ert Oct 01 '13 at 12:48
No, it will not work on multiple files (hadn't thought of that when I responded). The given code will not preserve order. However if you need to preserve order (and don't have the multiple files problem) just add the line `ordered = order NoHeader by rank_input_file` to get it in order. If you use NoHeader instead of New_input_file for later operations you can use the rank to get the data back into the original order at any point in the code that you require it by using order by rank. – Davis Broda Oct 01 '13 at 13:53
there is a much easier way if you have pig 0.12 or newer. See my answer using CSVExcelStorage. – Mike Pone Oct 30 '15 at 21:52

score 7 · Answer 3 · edited Mar 08 '14 at 22:30

Here is another way of doing this:

Load the complete file including header record in a relation

fileAllRecords = LOAD 'csvfilename' using PigStorage(',');

Use the Linux tail command to stream only the data records

fileDataRecords = STREAM fileAllRecords THROUGH `tail -n +2` AS (chararray:f1 ..)

To verify the header record is removed, use following command -

firstFewRecords = STREAM fileDataRecords THROUGH `head -20`;
DUMP firstFewRecords;

Mike Pone · Answer 4 · 2015-10-15T19:20:26.307

You want to use CSVExcelStorage found in piggybank. It allows to set parameters for how to handle headers, line endings, quoted fields and other CSV options. The constructor you want is only available in PIG versions atleast 0.12 and has the signature:

CSVExcelStorage(String delimiter, String multilineTreatmentStr, String eolTreatmentStr, String headerTreatmentStr)

pig code below :

REGISTER /usr/lib/pig/piggybank.jar;

input_file = load '/home/hadoop/smdb_tracedata.csv'
USING CSVExcelStorage(',', 'default', 'NOCHANGE', 'SKIP_INPUT_HEADER')
as (trans:chararray, carrier:chararray,aainday:chararray);

score -1 · Answer 5 · answered Mar 26 '16 at 05:55

This kind of errors generally occur when you are trying to convert incompatible datatypes. I have faced the similar issue and reason --> The file I am trying to load is containing header and displaying the error. The other probable reasons might by presence of NA's , Spaces in the column

Skipping the header while loading the text file using Piglatin

5 Answers5