1

I’ve to generate a report that will give me the sum of the counts from tables A, B and C for events that have been stored using Hive and my S3 buckets have been partitioned by Organization_id

For eg: Table A – Has a record for every day John (and other employees) goes to work Table B – Has a record for every call that John (and other employees) makes or takes at work Table C – Has a record for every expense that John(and other employees) submits at work

Basically I want a sum of the counts from A, B and C for John (employee_id) in the last month. There should be only one record for every date if there is a record in any of the 3 tables A, B or C (and sum the counts if there is a record for a date in one or more of the tables). So my Output is:

Employee id
Employee Name 
Date
Count
123
John
02-Jan-2016
55
123
John
12-Jan-2016
88
123
John
19-Jan-2016
103

The query that I came up with is:

select  adcts.employee_name, adcts.employee_id,Total_count as event_count, adcts.event_date  
from   
       (select   coalesce(Evts.employee_id,imps.employee_id,AEvts.employee_id) as   employee_id  
        ,   coalesce(Evts.employee_name,imps.employee_name,AEvts.employee_name) as   employee_name  
        , coalesce(Evts.Event_count,0) + coalesce(Imps.Impression_count,0)   + coalesce (AEvts.Event_Count,0)as Total_Count  
        , coalesce (Evts.event_date,imps.impression_date, AEvts.event_date)   as event_date  
    from  
        (select employee_id, employee_name, count(*) as   Event_count,event_date  
         from mm_events  
         where organization_id = 100048  
         and event_date between '2016-02-01' and '2016-02-04'  
        group by employee_id, employee_name,event_date) Evts  
       full outer join  
        (select employee_id, employee_name, count(*) as Impression_count,   impression_date   
         from mm_impressions  
         where organization_id = 100048  
         and impression_date between '2016-02-01' and '2016-02-04'  
        group by employee_id, employee_name,impression_date) Imps  
        on Evts.employee_id = Imps.employee_id  
       full outer join  
        (select employee_id, employee_name, count(*) as   Event_count,event_date  
         from mm_attributed_events  
         where organization_id = 100048  
         and event_date between '2016-02-01' and '2016-02-04'  
         and event_type = 'click'  
        group by employee_id, employee_name,event_date) AEvts  
     on AEvts.employee_id=Evts.employee_id  
       ) adcts     
join  
        (select distinct c.employee_id from default.t1_meta_dmp c   
         where c.employee_dmp_enabled='inherits'  
         and c.agency_dmp_enabled = 'inherits'  
         and c.agency_status='true'  
         and c.employee_status='true'  
         and c.organization_id = 100048) cc  
on adcts.employee_id=cc.employee_id  
order by adcts.employee_id asc  

I have 2 questions:

1. Do I have the right query? 2. Because I’m using ‘full outer join’ I get more than one entry for the same date. Can someone suggest a better way to achieve the result? Different query maybe

Alex
  • 21,273
  • 10
  • 61
  • 73
Ajay
  • 11
  • 2

1 Answers1

0

You are getting more than one entry for the same date because you are grouping by date in subqueries but joining them only by employee_id. That is why your records are duplicated after join. You should add event_date to the join condition as well.

It seems you do not need FULL JOIN at all. Join is more expensive than union all. Use UNION ALL select from each table then group by employee_name, employee_id, event_date and aggregate count() :

select employee_id, employee_name, sum(Event_count) as Total_Count , event_date 
    from
    (
    select employee_id, employee_name, count(*) as Event_count, event_date  from mm_events 
    where organization_id = 100048 and event_date between '2016-02-01' and '2016-02-04'
group by employee_id, employee_name, event_date

    union all  
    select employee_id, employee_name, count(*) as Event_count, impression_date as event_date   
    from mm_impressions
     where organization_id = 100048 and impression_date between '2016-02-01' and '2016-02-04' 
group by employee_id, employee_name,impression_date

    union all 
    select employee_id, employee_name, count(*) as Event_count,event_date  
    from mm_attributed_events 
    where organization_id = 100048  and event_date between '2016-02-01' and '2016-02-04'  and event_type = 'click'
group by employee_id, employee_name, event_date
    ) adcts
    group by employee_id, employee_name, event_date

Add your join with cc query to the above query.

All subqueries in UNION ALL will run in parallel

leftjoin
  • 36,950
  • 8
  • 57
  • 116