56

I am looking for a in-built String split function in Hive? e.g. if String is:

A|B|C|D|E

Then I want to have a function like:

array<string> split(string input, char delimiter)

So that I get back:

[A,B,C,D,E]

Does such a in-built split function exist in Hive.

I can only see regexp_extract and regexp_replace. I would love to see a indexOf() and split() string functions.

Gerard Roche
  • 6,162
  • 4
  • 43
  • 69
user855
  • 19,048
  • 38
  • 98
  • 162

3 Answers3

67

There does exist a split function based on regular expressions. It's not listed in the tutorial, but it is listed on the language manual on the wiki:

split(string str, string pat)
   Split str around pat (pat is a regular expression) 

In your case, the delimiter "|" has a special meaning as a regular expression, so it should be referred to as "\\|".

TobiSH
  • 2,833
  • 3
  • 23
  • 33
Bkkbrad
  • 3,087
  • 24
  • 30
  • 1
    semicolon is not in the list of regex metacharacters (see, e.g., https://www.hscripts.com/tutorials/regular-expression/metacharacter-list.php). Thus, you should just be able to use ';' instead of '\\;'. – Zak Kann Jul 01 '16 at 19:47
  • The page does not exist anymore: https://cwiki.apache.org/confluence/display/HADOOP2/Hive/LanguageManual/UDF#String_Functions – thegreatcoder Nov 13 '19 at 00:06
  • up to date Language Manual link: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions – Mihai Morcov Nov 28 '19 at 12:02
53

Another interesting usecase for split in Hive is when, for example, a column ipname in the table has a value "abc11.def.ghft.com" and you want to pull "abc11" out:

SELECT split(ipname,'[\.]')[0] FROM tablename;
TayTay
  • 6,882
  • 4
  • 44
  • 65
Sanjay Subramanian
  • 1,509
  • 14
  • 9
  • 1
    is the split can use in `GROUP BY` like `select split(area,'[_]')[0],isp,pc_mobile,device,count(userip) from usemap_without_ptime ORDER BY split(area,'[_]')[0],isp,pc_mobile,device` – timger Dec 21 '12 at 08:19
  • 2
    Actually you have to put the pattern in a regex. should be SELECT split(ipname,'\\.')[0] FROM tablename – malintha Oct 20 '14 at 07:22
13

Just a clarification on the answer given by Bkkbrad.

I tried this suggestion and it did not work for me.

For example,

split('aa|bb','\\|')

produced:

["","a","a","|","b","b",""]

But,

split('aa|bb','[|]')

produced the desired result:

["aa","bb"]

Including the metacharacter '|' inside the square brackets causes it to be interpreted literally, as intended, rather than as a metacharacter.

For elaboration of this behaviour of regexp, see: http://www.regular-expressions.info/charclass.html

spelunk1
  • 707
  • 6
  • 6