11

I'm having a lot of trouble getting data out of pig and into a CSV that I can use in Excel or SQL (or R or SPSS etc etc) without a lot of manipulation ...

I've tried using the following function:

STORE pig_object INTO '/Users/Name/Folder/pig_object.csv'
    USING CSVExcelStorage(',','NO_MULTILINE','WINDOWS');

It creates the folder with that name with lots of part-m-0000# files. I can later join them all up using cat part* > filename.csv but there's no header which means I have to put it in manually.

I've read that PigStorageSchema is supposed to create another bit with a header but it doesn't seem to work at all, eg, I get the same result as if it's just stored, no header file: STORE pig_object INTO '/Users/Name/Folder/pig_object' USING org.apache.pig.piggybank.storage.PigStorageSchema();

(I've tried this in both local and mapreduce mode).

Is there any way of getting the data out of Pig into a simple CSV file without these multiple steps?

Any help would be much appreciated!

Ted Hopp
  • 232,168
  • 48
  • 399
  • 521
Saxivore
  • 123
  • 1
  • 1
  • 7

2 Answers2

33

I'm afraid there isn't a one-liner which does the job,but you can come up with the followings (Pig v0.10.0):

A = load '/user/hadoop/csvinput/somedata.txt' using PigStorage(',') 
      as (firstname:chararray, lastname:chararray, age:int, location:chararray);
store A into '/user/hadoop/csvoutput' using PigStorage('\t','-schema');

When PigStorage takes '-schema' it will create a '.pig_schema' and a '.pig_header' in the output directory. Then you have to merge '.pig_header' with 'part-x-xxxxx' :

1. If result need to by copied to the local disk:

hadoop fs -rm /user/hadoop/csvoutput/.pig_schema
hadoop fs -getmerge /user/hadoop/csvoutput ./output.csv

(Since -getmerge takes an input directory you need to get rid of .pig_schema first)

2. Storing the result on HDFS:

hadoop fs -cat /user/hadoop/csvoutput/.pig_header 
  /user/hadoop/csvoutput/part-x-xxxxx | 
    hadoop fs -put - /user/hadoop/csvoutput/result/output.csv

For further reference you might also have a look at these posts:
STORE output to a single CSV?
How can I concatenate two files in hadoop into one using Hadoop FS shell?

Community
  • 1
  • 1
Lorand Bendig
  • 10,630
  • 1
  • 38
  • 45
  • Thanks very much for your assistance, Lorand. Calling the shell commands cleans things up a bit. However, the -schema argument doesn't create the .pig_header or .pig_schema files when I run it. I've tried using PigStorage and PigStorageSchema. I am running Pig 0.10.0. – Saxivore Dec 05 '12 at 00:47
  • Hi mate, further to this ... the .pig_header file _is_ created but it was hidden and I couldn't see it (I'm still getting used to working with shell stuff). So, I do have a workaround that can be done within script which is awesome. Thanks! – Saxivore Dec 05 '12 at 05:11
  • @Saxivore Check the schema of `pig_object.` If there isn't any there won't be created any header and schema files. Do you need any more help with this question? – Lorand Bendig Dec 05 '12 at 11:56
  • 1
    @Saxivore why wouldn't you make this as "best answer"? – Sasha O Jul 06 '13 at 17:24
  • @LorandBendig the PigStore() is creating binary part files is there a way to get text files. So that when we use getmarge it will be is readable cvs format – Dutta Jul 22 '15 at 15:48
2

if you will store your data as PigStorage on HDFS and then merge it using -getmerge -nl:

STORE pig_object INTO '/user/hadoop/csvoutput/pig_object'
    using PigStorage('\t','-schema');
fs -getmerge -nl /user/hadoop/csvoutput/pig_object  /Users/Name/Folder/pig_object.csv;

Docs:

Optionally -nl can be set to enable adding a newline character (LF) at the end of each file.

you will have a single TSV/CSV file with the following structure:

1 - header
2 - empty line
3 - pig schema
4 - empty line
5 - 1st line of DATA
6 - 2nd line of DATA
...

so we can simply remove lines [2,3,4] using AWK:

awk 'NR==1 || NR>4 {print}' /Users/Name/Folder/pig_object.csv > /Users/Name/Folder/pig_object_clean.csv
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419