1

I have a table of customer transactions where each item purchased by a customer is stored as one row. So, for a single transaction there can be multiple rows in the table. I have another col called visit_date. There is a category column called cal_month_nbr which ranges from 1 to 12 based on which month transaction occurred.

The data looks like below

Id          visit_date     Cal_month_nbr
----        ------          ------
1           01/01/2020      1
1           01/02/2020      1
1           01/01/2020      1
2           02/01/2020      2
1           02/01/2020      2
1           03/01/2020      3
3           03/01/2020      3

first I want to know how many times customer visits per month using their visit_date i.e i want below output

id    cal_month_nbr       visit_per_month
---        ---------     ----
1           1             2
1           2             1
1           3             1
2           2             1
3           3             1

and what is the avg frequency of visit per ids ie.

id            Avg_freq_per_month
----          -------------
1              1.33
2              1
3              1

I tried with below query but it counts each item as one transaction

select avg(count_e) as num_visits_per_month,individual_id
from
(
    select r.individual_id, cal_month_nbr, count(*) as count_e
 from 
  ww_customer_dl_secure.cust_scan 
         GROUP  by 
         r.individual_id, cal_month_nbr
         order by count_e desc
         ) as t
         group by individual_id

I would appreciate any help, guidance or suggestions

user4157124
  • 2,809
  • 13
  • 27
  • 42
krishna koti
  • 659
  • 1
  • 6
  • 10
  • How is 1.33 calculated for id 1? There are five visits over three months, which seems like 1.67 to me. – Gordon Linoff Apr 13 '20 at 20:03
  • @GordonLinoff Each row in the table is not a visit. Each entry is an item purchased for a transaction. Hence, for id 1, same visit_date i.e 01/01/2020 appears twice in the table which is only 1 transaction but two items purchased – krishna koti Apr 13 '20 at 20:08

1 Answers1

1

You can divide the total visits by the number of months:

select individual_id,
       count(*) / count(distinct cal_month_nbr)
from  ww_customer_dl_secure.cust_scan c
group by individual_id;

If you want the average number of days per month, then:

select individual_id,
       count(distinct visit_date) / count(distinct cal_month_nbr)
from  ww_customer_dl_secure.cust_scan c
group by individual_id;

Actually, Hive may not be efficient at calculating count(distinct), so multiple levels of aggregation might be faster:

select individual_id, avg(num_visit_days)
from (select individual_id, cal_month_nbr, count(*) as num_visit_days
      from (select distinct individual_id, visit_date, cal_month_nbr
            from ww_customer_dl_secure.cust_scan c
           ) iv 
      group by individual_id, cal_month_nbr
     ) ic
group by individual_id;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • the above Hive query works if each row is a unique transaction, but there can be multiple rows for a single visit, so visit_date tells us that for id 1 there are total 4 transactions or 4 visits – krishna koti Apr 13 '20 at 20:16
  • @krishnakoti . . . According to your question and sample data, you seem to be counting distinct dates, not visits. – Gordon Linoff Apr 13 '20 at 22:06