4

I have a hive table with the column "periode", the type of the column is string.

The column have values like the following:

[{periode:20160118-20160205,nb:1},{periode:20161130-20161130,nb:1},{periode:20161130-20161221,nb:1}]
[{periode:20161212-20161217,nb:0}]

I want to cast this column in array<struct<periode:string, nb:int>>. The final goal is to have one raw by periode. For this I want to use lateral view with explode on the column periode. That's why I want to convert it to array<struct<string, int>>

Thanks for help. Sidi

3 Answers3

1

You don't need to "cast" anything, you just need to explode the array and then unpack the struct. I added an index to your data to make it more clear where things are ending up.

Data:

idx arr_of_structs
0   [{periode:20160118-20160205,nb:1},{periode:20161130-20161130,nb:1},{periode:20161130-20161221,nb:1}]
1   [{periode:20161212-20161217,nb:0}]

Query:

SELECT idx                          -- index
  , my_struct.periode AS periode    -- unpacks periode
  , my_struct.nb      AS nb         -- unpacks nb
FROM database.table
LATERAL VIEW EXPLODE(arr_of_structs) exptbl AS my_struct

Output:

idx     periode                 nb
0       20160118-20160205       1
0       20161130-20161130       1
0       20161130-20161221       1
1       20161212-20161217       0

It's a bit unclear from your question what the desired result is, but as soon as you update it I'll modify the query accordingly.


EDIT:

The above solution is incorrect, I didn't catch that your input is a STRING.

Query:

SELECT REGEXP_EXTRACT(tmp_arr[0], "([0-9]{8}-[0-9]{8})") AS periode
  , REGEXP_EXTRACT(tmp_arr[1], ":([0-9]*)")              AS nb
FROM (
  SELECT idx
    , pos
    , COLLECT_SET(tmp_col) AS tmp_arr
  FROM (
    SELECT idx
      , tmp_col
      , CASE WHEN PMOD(pos, 2) = 0 THEN pos+1 ELSE pos END AS pos
    FROM (
      SELECT *
        , ROW_NUMBER() OVER () AS idx
      FROM database.table ) x
    LATERAL VIEW POSEXPLODE(SPLIT(periode, ',')) exptbl AS pos, tmp_col ) y
  GROUP BY idx, pos) z

Output:

periode                 nb
20160118-20160205       1
20161130-20161130       1
20161130-20161221       1
20161212-20161217       0    
o-90
  • 17,045
  • 10
  • 39
  • 63
  • My Json is large - about compressed 600+ MB REGEXP_EXTRACT fails out of memory! – user 923227 Feb 15 '18 at 19:23
  • @SumitKumarGhosh is 600 MB large? – o-90 Feb 15 '18 at 21:53
  • Yes, we too thought the same thing but when the job is running it is getting Heap Errors. And we are not able to process the files. 600 MB is max size of one json compressed - this needs to be split - so this seems to be loaded as one String. Hence the failures! – user 923227 Feb 16 '18 at 02:25
0

What about use the split function? you should be able to do something like

select nb, period from 
(select split(periode, "-") as periods, nb from yourtable) t
LATERAL VIEW explode(periods) sss AS period;

I didnt tried but it should work :)

EDIT: the above should work if you have a column periodes following a pattern date-date-date.. and a column nb, but it looks like that it isn't the case here. The following query should work for you (verbose but work)

select period, nb from (
select 
regexp_replace(split(split(tok1,",")[1],":")[1], "[\\]|}]", "") as nb,
split(split(split(tok1,",")[0],":")[1],"-") as periods
from
(select split(YOURSTRINGCOLUMN, "},") as s1 from YOURTABLE) 
r1 LATERAL VIEW explode(s1) ss1 AS tok1
) r2 LATERAL VIEW explode(periods) ss2 AS period;
hlagos
  • 7,690
  • 3
  • 23
  • 41
  • `SPLIT(periode, "-")` makes no sense; `periode` is a column name inside an array of structs. – o-90 Jan 24 '17 at 21:12
  • I have a hive table with the column "periode", the type of the column is string.....he has two columns.. – hlagos Jan 24 '17 at 22:54
  • No, this won't work. The OP specifically states their data is of type `array>`. You cannot call `regex_replace()` or `split()` on an array. – o-90 Jan 25 '17 at 00:12
  • he doesn't have an array... his data is everyrhing is one strng, you are assuming that it is in an array because the data looks like an array. He is saying that we would like cast the string to a structure... wait for more clarification if you need :) – hlagos Jan 25 '17 at 00:30
  • no problem, your solution is cleaner but much less efficient if we are talking about a lot of data, the solution without window functions doesn't reduce phase, which is an advantage if he is working with a lot of data – hlagos Jan 25 '17 at 03:50
0

I realize this question is 1YO, but I ran into this same issue and tackled it by using the json_split brickhouse UDF.

SELECT EXPLODE(
    json_split(
        '[{"periode":"20160118-20160205","nb":1},{"periode":"20161130-20161130","nb":1},{"periode":"20161130-20161221","nb":1}]'
));

col
{"periode":"20160118-20160205","nb":1}
{"periode":"20161130-20161130","nb":1}
{"periode":"20161130-20161221","nb":1}

Sorry for the spaghetti code.

There's also a similar question here using JSON arrays instead of JSON strings. It's not the same case, but for anyone facing this kind of task it might be useful in a bigger context.

misterte
  • 977
  • 1
  • 11
  • 21