1

I have a gzip file with data field separated by commas. I am currently using PigStorage to load the file as shown below:

A = load 'myfile.gz' USING PigStorage(',') AS (id,date,text);

The data in the gzip file has embedded characters - embedded newlines and commas. These characters exist in all the three fields - id, date and text. The embedded characters are always within the "" quotes.

I would like to replace or remove these characters using Pig before doing any further processing.

I think I need to first look for the occurrence of the "" quotes. Once I find these quotes, I need to look at the string within these quotes and search for the commas and new line characters in it. Once found, I need to replace them with a space or remove them.

How can I achieve this via Pig?

Shawn
  • 47,241
  • 3
  • 26
  • 60
activelearner
  • 7,055
  • 20
  • 53
  • 94
  • Fields are separated by comma's and within the fields can be comma's? What delimits the comma's in the fields ? Or, are these fields already parsed.. –  Jul 13 '15 at 21:56
  • @sln Yes, fields are separated by commas. And within the fields, there can be commas (which are not meant to be field separators but are just part of the text contained in the fields). The commas within the fields are within the "" quotes. – activelearner Jul 13 '15 at 22:14
  • @activelearner : CSVExcelStorage or CSVLoader will load texts represented in "" as a single field. Try the suggested answer and let me know. – Murali Rao Jul 14 '15 at 05:56

1 Answers1

2

Try this :

REGISTER piggybank.jar; 
A = LOAD 'myfile.gz' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (id:chararray,date:chararray,text:chararray);
B = FOREACH A GENERATE  REPLACE(REPLACE(id,'\n',''),',','') AS id, REPLACE(REPLACE(date,'\n',''),',','') AS date, REPLACE(REPLACE(text,'\n',''),',','') AS text;

We can use either : org.apache.pig.piggybank.storage.CSVExcelStorage() or org.apache.pig.piggybank.storage.CSVLoader().

Refer the below API links for details

  1. http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html
  2. http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/CSVLoader.html
Murali Rao
  • 2,287
  • 11
  • 18
  • 1
    Thanks a lot! This worked for me. The only modification that I made is while loading the gzip file, I used A = LOAD 'myfile.gz' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'YES_MULTILINE') AS (id:chararray,date:chararray,text:chararray); YES_MULTILINE allows for line breaks within the fields. This prevents the fields where the line breaks are present from getting truncated in the load step. For the actual removal of the embedded characters, the replace functions work once we load the data with the above change. – activelearner Jul 16 '15 at 00:19