0

i started working with hive to do some data preparation and ran into a peculiar problem when using the regexp_extract udf. I am working on a XML structures and i am trying to extract some elements from a XML-string. Here is an example. The string i am operating on is:

<b>ajsdnf</b>
<a>asdhf</a>
<a>alfnv</a>
<b>ajsdnf</b>
<a>test</a>

The regular expression (<a>.*?<\/a>) should extract all strings that contains only the elements with the a tags. When i check my logic on regex101 it finds all the right groups.

However when i run it against hive like this

select regexp_extract('<b>ajsdnf</b><a>asdhf</a><a>alfnv</a><b>ajsdnf</b><a>test</a>','(<a>.*?<\/a>)',0) from some_table limit 1;

it only returns the first <a>asdhf</a>. According to the documentation of regex_extract it should return all occurrences if i pass the integer 0 as 3rd parameter. Is there any chance i can achieve the following result

<a>asdhf</a>
<a>alfnv</a>
<a>test</a>

And if you are wondering why i am not using xpath to deal with this XML problem, i am having a much more complex structure and want to extract certain parts of the XML tree including all their child nodes. That is something the xpath udfs of hive cannot handle at the moment.

garkyn
  • 46
  • 6
  • What will happen if you remove `limit 1`? – Wiktor Stribiżew Jun 26 '15 at 09:56
  • 1
    regexp_extract doesn't work that way. It only matches once. Unless you know the number of matches, regexp_extract doesn't work. You might want to look at this: http://qnalist.com/questions/4957014/how-to-extract-multi-match-in-one-line-with-regexp-extract-function – nhahtdh Jun 26 '15 at 09:59
  • the answer in this thread [link](http://stackoverflow.com/questions/8497090/hive-regexp-extract-weirdness) suggest that an index of 0 should return all matched groups. Where am i wrong? – garkyn Jun 26 '15 at 10:15
  • @nhahtdh I followed according to the link, but it returns unmatched string.. – Prashanth Jul 24 '16 at 11:01
  • @Prashanth: Please ask a new question. I have no idea about your context, which might be different from the one in this question, so I have no way to give you any suggestion. – nhahtdh Jul 25 '16 at 02:35

1 Answers1

0
select regexp_replace('<b>ajsdnf</b><a>a<b>aksdhf</b>dhf</a><a>alfnv</a><b>ajsdnf</b><a>test</a>','(.*?)(<a>.*?<\/a>)(.*?)','$2') from some_tablelimit 1;

this did the trick. Thanks to nhahtdh for his suggestions

garkyn
  • 46
  • 6