0

I am new to Hadoop programming, looking for help in pig. I have data coming from simple.txt format as , delimeter. I have two use cases. I want to do ltrim(rtrim()) on all the columns and turn to UPPER for selected fields.

Here is my script:

party = Load '/party_test_pig.txt' USING PigStorage(',') AS(....);
Trim_party = FOREACH Upper_party GENERATE TRIM(*);
Upper_party = FOREACH party GENERATE UPPER(col1), UPPER(col2), UPPER(col3);

Upper_party:After making it uppercase, I want to view all the columns and not only columns that get change to upper case.

Trim_party:did some research and found out, to trim all columns I will have to write an UDF. I can do Trim_party = FOREACH Upper_party GENERATE TRIM(col1)...TRIM(coln); but I feel this is not an efficient way and time-consuming.

Is there any other way, I could make this script work without writing UDF for Trim?

Thanks in advance.

LazyBones
  • 113
  • 6

1 Answers1

1

it woulf be easier if you give a sample of your data. From what I understand, I would do this way :

-- Load each line as one string with TextLoader
A = LOAD '/user/guest/Pig/20151112.PigTest.txt' USING TextLoader() AS (line:CHARARRAY);
-- Apply TRIM and UPPER transformation, it will keep spaces that are inside your strings
B = FOREACH A GENERATE UPPER(line) AS lineUP;
-- Split lines with your delimiter
C = FOREACH B GENERATE FLATTEN(STRSPLIT(lineUP, ',')) AS (col1:CHARARRAY, ... ,coln:CHARARRAY);
-- Select the columns you need
D = FOREACH C GENERATE TRIM(col1) AS col1T, ..., TRIM(coln) AS colnT;
AntonyBrd
  • 403
  • 2
  • 10
  • hi @AntonyBrd thank you for the answer. Upper worked properly. But Trim didn't work. – LazyBones Nov 12 '15 at 17:38
  • I even ran `B = FOREACH A GENERATE TRIM(line) AS lineTRIM;` just to verify if it works but it failed here too. – LazyBones Nov 12 '15 at 17:52
  • RECORD 1: `101,2015-11-11,201, hola ,Shah,Rukh,Khan, Shahrukh Khan ,SRK,Mr,Male,Married,Hindi,2065,1965-11-02,2065-11-02,1992-11-02,2065-11-02,100` RECORD 2: `102,2015-11-12,202, hi ,Kajol,Tanuja,Mukerjee, Kajol Devgan ,KD,Mrs,Female,Married,Hindi,2066,196-11-03,2065-11-03,1992-11-03,2065-11-03,101` – LazyBones Nov 12 '15 at 17:58
  • it failed with an error or it failed to do what you had expected ? If so, what did you expect ? – AntonyBrd Nov 13 '15 at 07:50
  • it failed to what I expected... so my input was ##Shahrukh Khan## . Consider ## as space which I would like to TRIM in my final output. The expected ouput is Shahrukh Khan (left and right space should be removed). – LazyBones Nov 13 '15 at 13:54
  • Input: `101,2015-11-11,201, hola ,Shah,Rukh,Khan, Shahrukh Khan ,SRK,Mr,Male,Married,Hindi,2065,1965-11-02,2065-11-02,1992-11-02,2065-11-02,100`. As you can see in 8th column there is an extra space after comma. So I want to get rid of that space. Expected: `101,2015-11-11,201, hola ,Shah,Rukh,Khan,Shahrukh Khan,SRK,Mr,Male,Married,Hindi,2065,1965-11-02,2065-11-02,1992-11-02,2065-11-02,100` – LazyBones Nov 13 '15 at 13:57
  • OK the reason is actualy quite simple, my bad ! If we apply TRIM to the whole line it wont affect each field. I changed my code. Sorry. – AntonyBrd Nov 13 '15 at 14:27
  • so I have TRIM each and every column ? `D = FOREACH C GENERATE TRIM(col1) AS col1T, TRIM(col2) AS col2T, TRIM(col3) AS col3T, TRIM(col4) AS col4T;` What if I have 200 columns ? – LazyBones Nov 13 '15 at 14:39
  • Yes, you have to TRIM each field. If you have 200 columns, I suggest to remove all spaces : A2 = FOREACH A GENERATE REPLACE(line,' '',''); If you don't want to remove all spaces, make a UDF that takes a tuple of 200 elements and return an other tupple of 200 elements. – AntonyBrd Nov 13 '15 at 15:44
  • I think UDF will be a better option.. coz what happenes by replacing is if there is a space between name `Shahrukh Khan` then it becomes `ShahrukhKhan`. Thanks @AntonyBrd – LazyBones Nov 13 '15 at 15:56
  • Hi @AntonyBrd I wrote the pig udf from [here](http://stackoverflow.com/questions/29413674/applying-trim-in-pig-for-all-fields-in-a-tuple). I want to apply to all fields now so what shud be my `FOREACH` statement. `trim = FOREACH xyz GENERATE package_name.class_name(??);` – LazyBones Nov 16 '15 at 22:36
  • Hi @LazyBones, you will first havo to register your jar in your pig script, then you must define your function : `REGISTER /path/to/jar/myJar.jar;` and `DEFINE STRTRIM packagename.classname();` But I dont think the UDF you wrote will give the result you expect. You should trim each value of your tuple and not only the first one (I suggest to use a lambda expression or something like that instead of a for loop) – AntonyBrd Nov 17 '15 at 07:57