0

The following regexp_extract function appears to work in Impala, but does not work when I use it in Hive:

select regexp_extract("efwe FR wefwef", '.*?([[:upper:]]+).*?', 1)

The result in Impala is FR (as I would expect, i.e. the upper case characters from the first group)

The result in Hive is e (not what I would expect)

Can anyone explain why this is?

From researching this issue I have read that converting the regular expression to java style regex may help (http://www.regexplanet.com/advanced/java/index.html). But as far I know a Java Style Regex is the same as what I have.

dglozano
  • 6,369
  • 2
  • 19
  • 38
Tom1281
  • 11
  • 3

1 Answers1

1

I discovered the answer myself. Java does not support POSIX bracket expressions, so I used A-Z rather than :upper:

https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_string_functions.html In Impala 2.0 and later, the Impala regular expression syntax conforms to the POSIX Extended Regular Expression syntax used by the Google RE2 library. For details, see the RE2 documentation.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-REGEXColumnSpecification We use Java regex syntax. Try http://www.fileformat.info/tool/regex.htm for testing purposes.

Tom1281
  • 11
  • 3