0

I have a array of string called tags in my hive table. I want to join each element from this array with a ',' delimiter and form a string. However, while joining I don't want to join the string if it doesn't include ':' character in it.

The reason I want to do this is, once string is joined, it should become, a:1,b:2,c:3. Then I want to apply str_to_map to make it a dictionary. Right now my str_to_map is failing if input doesn't include ':' in even of the elements.

Minimum input to fail: ["abc-xyz-1"]

SELECT
    CAST(SPLIT(hostname, '-')[1] AS BIGINT) AS host_id,
    str_to_map(concat_ws(',', tags)) AS tags,
    stack_trace
FROM test_events;

1 Answers1

0

I would suggest "exploding" the array data into separate rows, which allows use of an IF on each element inside the concat_ws. Like this (untested code):

SELECT
    CAST(SPLIT(hostname, '-')[1] AS BIGINT) AS host_id,
    str_to_map(concat_ws(',', 
        IF(instr(tag, ':') > 0, tag, NULL)
    )) AS tags,
    stack_trace
FROM test_events
LATERAL VIEW explode(tags) exploded_tags AS tag
GROUP BY CAST(SPLIT(hostname, '-')[1] AS BIGINT), stack_trace;

refs: lateral view & explode

I have to admit, not entirely sure if the group by is needed, but I have assumed it is.

I suggest experimentation with this snippet:

    SELECT hostname, stack_trace, tag
    FROM test_events
    LATERAL VIEW explode(tags) exploded_tags AS tag
    WHERE instr(tag, ':') > 0

Should avoid the tags without the ":", and if you change the where clause you could locate the tags that are non-compliant.

Paul Maxwell
  • 33,002
  • 3
  • 32
  • 51
  • Hi Paul, this looks like a best answer from what I have come across. Thank you so much!! Just wanted to discuss about how expensive this query would be? – Sameer Shinde Apr 04 '23 at 05:26
  • Well it is going to cost some performance, but that is all I can honestly say. It is unlikely that it would be very noticeable, but I suspect that it will be the group by that would cost the most. Maybe you can do what you need to without grouping? (and that means dropping the concat_ws too of course). Not much more I can add or suggest. experiment with small a small sample perhaps. – Paul Maxwell Apr 04 '23 at 05:30