2

enter image description here

I have two columns id and segment. Segment is comma separated set of strings. I need to find average number of segments in all the table. One way to do it is by using two separate queries -

A - select count(*) from table_name;
B - select count(*) from table_name LATERAL VIEW explode(split(segment, ',') lTable AS singleSegment where segment != ""
avg = B/A

Answer would be 8/4 = 2 in the above case.

Is there a better way to achieve this ?

BlitzKrieg
  • 791
  • 8
  • 15

1 Answers1

2

Try:

select sum(CASE segment 
           WHEN '' THEN 0 
           ELSE  size(split(segment,','))
           END
           )*1.0/count(*) from table_name;

If your id field is unique, and you want to add a filter to the segment part, or protect against other malformed segment values like a,b, and a,,b, you could do:

SELECT SUM(seg_size)*1.0/count(*) FROM (
    SELECT count(*) as seg_size from table_name
    LATERAL VIEW explode(split(segment, ',')) lTable AS singleSegment
    WHERE trim(singleSegment) != ""
    GROUP BY id
) sizes

Then you can add other stuff into the where clause.

But this query takes two Hive jobs to run, compared to one for the simpler query, and requires the id field to be unique.

Thomas Andrews
  • 1,577
  • 1
  • 13
  • 31
  • The longer version of the query above worked fine. Thanks !! – BlitzKrieg Apr 22 '16 at 21:36
  • Yeah, I deleted the wrong first query, so you really mean the shorter query, now :) @BlitzKrieg The longer query is gonna be slower, but it does offer some more flexibility. – Thomas Andrews Apr 25 '16 at 14:35