1

I am using Apache PIG to process some data, and at the end of my script I use

store data into  '/mypath/tempp2' using PigStorage('\t','-schema');
fs -getmerge /mypath/tempp2  /localpath/data.tsv;

That way I have a tsv file that I readable with read_csv(headers=0) in Pandas.

The problem is that the tsv file now contains the headers on the first row (which is nice) but also the schema concatenated to the first observation in the second row such as:

col1             col2      col3     
{pigschema}0     1         2      

assuming the first row is [0,1,2]. So unless I use skiprows=1 in read_csv (losing that row), I get this weird observation in my data.

So I wonder if there is a better way to export my data, while getting the headers.

Many thanks!

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235

1 Answers1

2

First of all you want to use -nl parameter for -getmerge:

store data into  '/mypath/tempp2' using PigStorage('\t','-schema');
fs -getmerge -nl /mypath/tempp2  /localpath/data.tsv;

Docs:

Optionally -nl can be set to enable adding a newline character (LF) at the end of each file.

then you'll have in your /localpath/data.tsv the following structure:

0 - headerline
1 - empty line
2 - PIG schema
3 - empty line
4 - 1-st line of DATA
5 - 2-nd line of DATA
...

so now you can easily read it in pandas:

df = pd.read_csv('/localpath/data.tsv', sep='\t', skiprows=[1,2,3])
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419