2

I have a table with two columns: id, datastring

The id column is just a bigint and the datastring column has elements that look like

{"12345":[6789,true]}

{"1234678":[5678, false]}

I would like to select a table where the first column is the id and the second column is the number in the quotes part of the datastring. However, this number is not always the same number of digits.

The result should be

id, numstring

4321, 12345
4322, 134678

Thanks in advance.

Parsa
  • 1,137
  • 1
  • 11
  • 15

1 Answers1

2

You have at least two options. The first one is slow and clean, the second one is quick and dirty.

Let's start with the second option - built-in Hive function regexp_extract:

SELECT id, regexp_extract(datastring, '^\\{"(\\d+)".*$', 1) AS numstring 
  FROM your_table

Done!

regexp_extract takes three arguments: string, regexp, and group index. If regexp matches the string then regexp_extract returns regexp group specified by the group index.

The first option is to write your own UDF in Java that'll extract numstring from datastring using Jackson or whatever. I would go with the regexp_extract. But that's me.

Nigel Tufnel
  • 11,146
  • 4
  • 35
  • 31