Count the number of transactions per month for an individual group by date Hive

Question

I have a table of customer transactions where each item purchased by a customer is stored as one row. So, for a single transaction there can be multiple rows in the table. I have another col called visit_date. There is a category column called cal_month_nbr which ranges from 1 to 12 based on which month transaction occurred.

The data looks like below

Id          visit_date     Cal_month_nbr
----        ------          ------
1           01/01/2020      1
1           01/02/2020      1
1           01/01/2020      1
2           02/01/2020      2
1           02/01/2020      2
1           03/01/2020      3
3           03/01/2020      3

first I want to know how many times customer visits per month using their visit_date i.e i want below output

id    cal_month_nbr       visit_per_month
---        ---------     ----
1           1             2
1           2             1
1           3             1
2           2             1
3           3             1

and what is the avg frequency of visit per ids ie.

id            Avg_freq_per_month
----          -------------
1              1.33
2              1
3              1

I tried with below query but it counts each item as one transaction

select avg(count_e) as num_visits_per_month,individual_id
from
(
    select r.individual_id, cal_month_nbr, count(*) as count_e
 from 
  ww_customer_dl_secure.cust_scan 
         GROUP  by 
         r.individual_id, cal_month_nbr
         order by count_e desc
         ) as t
         group by individual_id

I would appreciate any help, guidance or suggestions

How is 1.33 calculated for id 1? There are five visits over three months, which seems like 1.67 to me. — Gordon Linoff, Apr 13 '20 at 20:03
@GordonLinoff Each row in the table is not a visit. Each entry is an item purchased for a transaction. Hence, for id 1, same visit_date i.e 01/01/2020 appears twice in the table which is only 1 transaction but two items purchased — krishna koti, Apr 13 '20 at 20:08

Gordon Linoff · Answer 1 · 2020-04-13T20:08:06.657

1

You can divide the total visits by the number of months:

select individual_id,
       count(*) / count(distinct cal_month_nbr)
from  ww_customer_dl_secure.cust_scan c
group by individual_id;

If you want the average number of days per month, then:

select individual_id,
       count(distinct visit_date) / count(distinct cal_month_nbr)
from  ww_customer_dl_secure.cust_scan c
group by individual_id;

Actually, Hive may not be efficient at calculating count(distinct), so multiple levels of aggregation might be faster:

select individual_id, avg(num_visit_days)
from (select individual_id, cal_month_nbr, count(*) as num_visit_days
      from (select distinct individual_id, visit_date, cal_month_nbr
            from ww_customer_dl_secure.cust_scan c
           ) iv 
      group by individual_id, cal_month_nbr
     ) ic
group by individual_id;

edited Apr 13 '20 at 20:08

answered Apr 13 '20 at 20:02

Gordon Linoff

1,242,037
58
646
786

the above Hive query works if each row is a unique transaction, but there can be multiple rows for a single visit, so visit_date tells us that for id 1 there are total 4 transactions or 4 visits – krishna koti Apr 13 '20 at 20:16
@krishnakoti . . . According to your question and sample data, you seem to be counting distinct dates, not visits. – Gordon Linoff Apr 13 '20 at 22:06

Count the number of transactions per month for an individual group by date Hive

1 Answers1