1

I m trying to write pig script that gets string data like this: abc|def|xyz and tries to put these values into an array of string.

How do i split this string to get an array of string like [abc,def,xyz] ?

I tried using STRSPLIT function, but the no of splits in my case is not fixed. The number of pipe separated values can vary and i need all of those value to be in that array.

Any suggestions???

user3335722
  • 82
  • 3
  • 9

2 Answers2

4

You were in the right direction, but there is one thing of the STRSPLIT you didn't notice. You can use it also when the number of splits is not fixed. The third argument for that UDF is the number of 'splits' you have, but you can pass a negative number and it will look for all the possible splits that match your expression.

From the official documentation for STRSPLIT:

limit

If the value is positive, the pattern (the compiled representation of the regular expression) is applied at most limit-1 times, therefore the value of the argument means the maximum length of the result tuple. The last element of the result tuple will contain all input after the last match.

If the value is negative, no limit is applied for the length of the result tuple.

Imagine this input:

abc|def|xyz,1
abc|def|xyz|abc|def|xyz,2

You can do the following:

A = load 'data.txt' using PigStorage(',');
B = foreach A generate STRSPLIT($0,'\\|',-1);

And the output will be:

DUMP B;

((abc,def,xyz))
((abc,def,xyz,abc,def,xyz))
Community
  • 1
  • 1
Balduz
  • 3,560
  • 19
  • 35
  • I have 1 more problem...The response i m getting is of STRUCT type, but i want to store this data in hive table in an ARRAY type variable...This causes a type mismatch...Any solution for this??? – user3335722 Jul 23 '15 at 03:44
  • You can try STRSPLITTOBAG https://pig.apache.org/docs/latest/func.html#strsplittobag – deepkimo Jul 17 '18 at 21:21
2

Another feasible option is to make use of TOKENIZE. Would suggest to go with the solution suggested by @Balduz.

A = load 'data.txt' using PigStorage(',');
B = foreach A generate BagToString(TOKENIZE($0,'|'),',');
DUMP B;

Output : DUMP B :

(abc,def,xyz)
(abc,def,xyz,abc,def,xyz)
Murali Rao
  • 2,287
  • 11
  • 18
  • Thanks Murali...the answer that Balduz gave is working but the response i m getting is of STRUCT type and i want to store this in a hive table in an ARRAY...This causes a type mismatch...Any solution for this??? – user3335722 Jul 23 '15 at 03:46