Group by on single column and select multiple columns in hive

Question

I have data in a Hive table as given below.

Table 1:

ID CODE VALUE  

1  XXXX 100  
2  AAAA 200
1  YYYY 300  
3  DDDD 300  
4  BBBB 200  
2  CCCC 300  
3  HHHH 200

I have to select ID, VALUE from Table 1 and load into Table 2.Since Table 1 has duplicate ID's, i have to get unique rows from Table 1 based on the data in CODE Field.

order of selection of rows from table 1 is first priority to be given to CCCC, then to XXXX, then to AAAA and so on.

I am expecting Table 2 to have data like below-

Group by wont help you here. Consider using Windowing to select 1 row per CODE. Regarding ordering, not sure if this can be implemented using available functions in hive. — kalpesh, Feb 13 '20 at 06:04
I am thinking of a solution which i have to still try to filter the source table into 6 different tables based on the code field and then use a case to populate from first table else from 2nd table and so on — Surbhi Jain, Feb 13 '20 at 06:26
Windowing is best suited for this case as you have to group on single column while selecting multiple columns. I have a suggestion, after you derive 1 row per CODE, derive column named "priority" which will have numeric value. You can write your logic while deriving this column like if(CODE=='CCCC') then priority=1, if(CODE=='XXXX' then priority=1 and so on. You can use this new column for further logic. — kalpesh, Feb 13 '20 at 06:40
Have already tried this approach to have ranks/priorities set up but since same table satisfies all proirities, i get duplicates in my target. — Surbhi Jain, Feb 13 '20 at 06:44

leftjoin · Accepted Answer · 2020-02-13T08:23:35.003

You can define priority in the CASE statement. Assign priorities to each CODE or list of codes. Priority calculated in the CASE will be used for ordering inside ID partition before numbering rows by row_number analityc function. Rows with the same ID and same priority will be ordered randomly, so better try to define priorities for all codes. See comments in the code:

with your_data as ( --Use your table instead of this CTE
select stack(7,
1,'XXXX',100, 
2,'AAAA',200,
1,'YYYY',300, 
3,'DDDD',300, 
4,'BBBB',200, 
2,'CCCC',300, 
3,'HHHH',200
) as (ID,CODE,VALUE)
)

select ID, VALUE 
from
(
select ID,CODE,VALUE,
       row_number() over(partition by ID order by case when CODE='CCCC' then 1 --define priority here
                                                       when CODE='XXXX' then 2
                                                       when CODE='AAAA' then 3
                                                       when CODE='DDDD' then 4
                                                       when CODE='HHHH' then 5
                                                       --and so on... add more cases
                                                       else 99  --default priority
                                                    end) rn
  from your_data d
)s
where rn=1 --remove duplicates  
;

Result:

OK
1       100
2       300
3       300
4       200

Exactly..@Surbhi Jain This is what, I was saying in comments. You can accept this as answer. — kalpesh, Feb 13 '20 at 08:13

Group by on single column and select multiple columns in hive

1 Answers1