1

I have a HUGE table that contains details about all customers, such as their phone numbers and email addresses. The table can have multiple rows for each customer, indicating that at some point they have changed their phone number, email address, or other profile information. Every row has a day column, indicating the day that the profile change happened (i.e. the state of their profile on that day).

I want to extract every phone number and email address that each customer has ever had on their profile, and also add a boolean flag (called 'live') indicating which phone number and email is currently associated with their profile (i.e. the most recent record we have in the table for that customer).

Here is my current query:

SELECT DISTINCT
    customer_id,
    phone_number,
    email AS email_address,
    CASE
        WHEN day = (
            SELECT
                MAX(temp.day)
            FROM
                customer AS temp
            WHERE
                customer.customer_id = temp.customer_id
        )
        THEN true
        ELSE false
    END AS live
FROM
    customer

I assume this is terribly inefficient given that there are billions of rows for millions of different customers. How can I impove this query to achieve what I want (maybe a window function?), or would soem completely different way better achieve what I want?

KOB
  • 4,084
  • 9
  • 44
  • 88
  • Please clarify: if a customer had first phone_number `A` than changed it to `B` and finally revert it to `A` again (email remains the same), do you expect to see three row for this customer or only two? – Marmite Bomber Sep 06 '20 at 17:46
  • @MarmiteBomber yes, there would be 3 rows – KOB Sep 07 '20 at 08:03
  • Fine @KOB, if you want to preserve all rows, you should *remove* the `DISTINCT` from your query... – Marmite Bomber Sep 07 '20 at 10:04

4 Answers4

1

You can use row_number() to set the flag:

SELECT 
    customer_id,
    phone_number,
    email AS email_address,
    (ROW_NUMBER() over(partition by customer_id ORDER BY day DESC) = 1) as is_live
FROM customer

I am unsure whether Presto understands a condition as a boolean value - if not, you can do:

CASE WHEN ROW_NUMBER() over(partition by customer_id ORDER BY day DESC) = 1
    THEN true
    ELSE false
END as is_live
GMB
  • 216,147
  • 25
  • 84
  • 135
  • I prefer this answer since it doesn't matter if the rows are unique or not, as far as I can tell. Would there be any benefit to using RANK() instead of ROW_NUMBER()? – KOB Sep 06 '20 at 17:01
  • @KOB: `rank()` is there to allow ties (that is, if you have several records per `customer_id` and `day`). If there are no ties, both `ROW_NUMBER()` and `RANK()` are identical. – GMB Sep 06 '20 at 17:03
  • This query fails to handle the case when the *phone_number* is reused, i.e. A -> B -> A will produce 3 rows, not two B + A(live) as expected. – Marmite Bomber Sep 06 '20 at 17:36
  • 1
    @MarmiteBomber: OP does not want to filter rows, they want to flag them. Besides, `row_number()` gives only one `1` per `customer_id`. – GMB Sep 06 '20 at 17:41
  • 1
    @MarmiteBomber: *I want to extract every phone number and email address that each customer has ever had on their profile, and also add a boolean flag (called 'live') indicating which phone number and email is currently associated with their profile* sounds clear to me – dnoeth Sep 06 '20 at 17:58
  • @dnoeth if in a versioned table the validity timestamp is removed (which the query does) it is questionable if you want to see the *same resource several times* if it was used repeatedly in the past - see the sample data in my alternative answer – Marmite Bomber Sep 06 '20 at 18:36
1

If the records are unique, you can use:

SELECT customer_id, phone_number, email AS email_address,
       (CASE WHEN RANK() OVER (PARTITION BY customer_id ORDER BY day DESC) = 1
             THEN true ELSE false
        END) as is_live
FROM customer c;

Presto supports booleans, so you don't need a CASE expression. But I have kept it because you have it in your code.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • They are unique when the 'day' column is included, since there is at most one record per day, per customer. Is this ok? – KOB Sep 06 '20 at 16:58
  • Can you explain why I would use RANK() over ROW_NUMBER()? – KOB Sep 06 '20 at 17:02
  • @KOB . . . They are probably equivalent in this case. However, the phrasing of the question suggested to me that you could have ties in the data. If there are multiple records for a customer with the most recent date, then all are flagged `true`, because you did not specify how to break ties. – Gordon Linoff Sep 07 '20 at 02:14
0

This is your initial Correlated Subquery translated into a Group Max, same result as the RANK version:

CASE
    WHEN day = MAX(temp.day) over (partition by customer_id)
    THEN true
    ELSE false
END AS live
dnoeth
  • 59,503
  • 4
  • 39
  • 56
0

To show every phone number and email address that each customer has ever had on their profile with the meaning every combination of those two resource only once use the following query.

Here the sample date

create table tab as
select 1 customer_id, 'A' phone_number, '1@1' email, DATE'2020-01-01' day  from dual union all
select 1 customer_id, 'B' phone_number, '1@1' email, DATE'2020-02-01' day  from dual union all
select 1 customer_id, 'A' phone_number, '1@1' email, DATE'2020-03-01' day  from dual union all
select 2 customer_id, 'C' phone_number, '1@1' email, DATE'2020-01-01' day  from dual union all
select 2 customer_id, 'C' phone_number, '1@1' email, DATE'2020-04-01' day  from dual 
;

Query

with live as (
select CUSTOMER_ID, PHONE_NUMBER, EMAIL, DAY ,
case when row_number() over (partition by CUSTOMER_ID order by day DESC) = 1 then 1 else 0 end is_live
from tab) 
select   CUSTOMER_ID, PHONE_NUMBER, EMAIL, 
case when max(is_live) = 1 then 'true' else 'false' end is_life
from live
group by CUSTOMER_ID, PHONE_NUMBER, EMAIL

which produce

CUSTOMER_ID P EMA IS_LI
----------- - --- -----
          1 B 1@x false
          1 A 1@x true 
          2 C 2@x true 

The subquery identifies the last (live) record, the GROUP BY query produce only the unique combination of the resources (similar, but better that SELECT DISTINCTin your solution - that fails if the same resources are in a live and non live row).

Marmite Bomber
  • 19,886
  • 4
  • 26
  • 53