0

I have taken off some of the columns from a CSV using pig script:

Cleaned = FOREACH data generate  $0 .. $8,$11 .. $27, $31 .. $41, $45 .. $97, $99 .. $111;

In the columns that I have kept, I need to take off any new line character that may corrupt my data in hive. Be it \n or \r or \r\n or <br>. Since it is user entered data, I believe the line breaks created while typing the data using enter key would be one of the characters mentioned above, would appreciate if you could also specify what it is converted to but for the most part I need to make sure that any sort of line break is taken off the data to make sure that my data is mapped properly by hive CSV Parser. How do I do it in the pig script I am using to filter out the columns ?

Edit: 1. I wish to continue using column range instead of having to specify each column. 2. The example pointed to does not take care of all type of new line characters.

gags
  • 255
  • 2
  • 17
  • Possible duplicate of [Pig - Remove line feed, return and tab](http://stackoverflow.com/questions/36212024/pig-remove-line-feed-return-and-tab) – Hack-R Oct 09 '16 at 14:38
  • `B = FOREACH A GENERATE REPLACE(REPLACE(id,'\n',''),',','') AS id, REPLACE(REPLACE(date,'\n',''),',','') AS date, REPLACE(REPLACE(text,'\n',''),',','') AS text;` this is just an example you can tweak http://stackoverflow.com/questions/31394130/pig-remove-embedded-newlines-and-commas-in-gzip-files – Hack-R Oct 09 '16 at 14:39
  • @Hack-R: I cam across this example. I have specified column range here. How do I integrate column range with that? – gags Oct 09 '16 at 14:51
  • 1
    Sidenote: You are setting yourself up for a world of hurt if you refer to columns by their position and your input changes and a column is added/removed somewhere in the middle. My (personal) advice is to refer to columns by name rather than number. (So for instance `FirstCol .. SomeCol` rather than `$0 .. $8`. Note that this still allows you to work with ranges. – Dennis Jaheruddin Oct 26 '16 at 08:09

2 Answers2

2

Since you have a large number of columns, you are better off loading the data as a line:chararray and then using REPLACE function.Once the data is cleansed,you can use the STRSPLIT to split the line into columns and then use the range to get the required columns.

A = LOAD 'data.csv' AS (line:chararray);
B = FOREACH A GENERATE REPLACE(REPLACE(REPLACE(REPLACE(line,'\\n',''),'\\r',''),'\\r\\n',''),'<br>','');
C = FOREACH B GENERATE FLATTEN(STRSPLIT($0,','));
D = FOREACH C GENERATE  $0 .. $8,$11 .. $27, $31 .. $41, $45 .. $97, $99 .. $111;
DUMP D;
nobody
  • 10,892
  • 8
  • 45
  • 63
  • Can we use replace(line, '(\n|\t)', '') instead using multiple replace() functions? – Shekhar Jan 23 '17 at 12:30
  • 1
    @Shekhar No. REPLACE syntax is REPLACE(string, 'oldChar', 'newChar'); You cannot use regex or replace multiple characters using | in a single replace unless newer version of Pig has that capability.Best way is to test it out yourself. – nobody Jan 23 '17 at 15:18
0

You can use "org.apache.pig.piggybank.storage.CSVExcelStorage(',')" storage class to import data and then you can use replace command to replace "new line" characters with in a field's data.

REGISTER /usr/lib/pig/piggybank.jar; a = LOAD '/path/to/file.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',') b = FOREACH a GENERATE $1; DUMP b

Piggybank is a default jar provided, so it should be available with out registering.

prasad
  • 217
  • 5
  • 10