Hive query counts of fields where fields are populated

Question

I have a huge Hive table consisting of ten product fields, date fields for the purchases, and an identifier. The product fields are named like prod1, prod2, ... , prod10 and refer to the last ten products purchased. For most IDs, we don't have purchase history all the way back to ten products.

I'd like to construct a distribution of population rates for each of the prod<X> fields, to show the breakdown of purchase history across the entire dataset.

Currently, I'm running a bash script that runs ten consecutive queries against the table like:

hive -e "select count(1) from db.tbl where prod<X> != '';"

... and saving the output to a file. This seems clunky and inefficient. Is there a better way to specify Hive counts on a range of fields with a range of field conditions? I've tried to come up with a strategy using groupby or even mapping a range of fields, but can't quite wrap my head around specifying the != '' condition for each field.

Thanks in advance for any direction.

score 1 · Accepted Answer · answered Oct 21 '15 at 13:08

select id,
sum(case when prod1='' then 0 else 1 end),
sum(case when prod2='' then 0 else 1 end),
sum(case when prod3='' then 0 else 1 end),
sum(case when prod4='' then 0 else 1 end),
sum(case when prod5='' then 0 else 1 end),
sum(case when prod6='' then 0 else 1 end),
sum(case when prod7='' then 0 else 1 end),
sum(case when prod8='' then 0 else 1 end),
sum(case when prod9='' then 0 else 1 end),
sum(case when prod10='' then 0 else 1 end)
from table group by id;

It all seems so simple in retrospect... Thanks a ton! – economy Oct 26 '15 at 19:58 — economy, Oct 26 '15 at 19:58

Hive query counts of fields where fields are populated

1 Answers1