I have a huge Hive table consisting of ten product fields, date fields for the purchases, and an identifier. The product fields are named like prod1
, prod2
, ... , prod10
and refer to the last ten products purchased. For most IDs, we don't have purchase history all the way back to ten products.
I'd like to construct a distribution of population rates for each of the prod<X>
fields, to show the breakdown of purchase history across the entire dataset.
Currently, I'm running a bash script that runs ten consecutive queries against the table like:
hive -e "select count(1) from db.tbl where prod<X> != '';"
... and saving the output to a file. This seems clunky and inefficient. Is there a better way to specify Hive counts on a range of fields with a range of field conditions? I've tried to come up with a strategy using groupby or even mapping a range of fields, but can't quite wrap my head around specifying the != ''
condition for each field.
Thanks in advance for any direction.