0

I have two sets of data coming from external source - purchase date of the customer and last email click/open date of the customer. This is stored in two tables PURCHASE_INTER AND ACTIVITY_INTER tables respectively. Purchase data is in multiple and I need to pick up last purchase date. But activity data is unique for each customer. The data is independent of each other and other set of data may not be present. We have written below query which combines two tables, groups them based on person_id which is id of the customer coming from external source and get the latest of the dates, join with our customer table to get the customer email, and again join with another table where this data will be ultimately stored for sake of knowing whether it is insert or update operation. Could you please suggest how I can improve the performance of this query. It is terribly slow and taking more than 10 hours. There are millions of records coming in PURCHASE_INTER AND ACTIVITY_INTER tables.

SELECT INTER.*, C.ID AS CUSTOMER_ID, C.EMAIL AS CUSTOMER_EMAIL, LSI.ID AS INTERACTION_ID, ROW_NUMBER() OVER (ORDER BY PERSON_ID ASC) AS RN FROM (
   SELECT PERSON_ID               AS PERSON_ID,
        MAX(LAST_CLICK_DATE)    AS LAST_CLICK_DATE,
        MAX(LAST_OPEN_DATE)     AS LAST_OPEN_DATE,
        MAX(LAST_PURCHASE_DATE) AS LAST_PURCHASE_DATE
   FROM (
     SELECT ACT.PERSON_ID AS PERSON_ID,
          ACT.LAST_CLICK_DATE AS LAST_CLICK_DATE,
          ACT.LAST_OPEN_DATE AS LAST_OPEN_DATE,
          NULL AS LAST_PURCHASE_DATE
     FROM ACTIVITY_INTER ACT
     WHERE ACT.JOB_ID = 77318317
     UNION
     SELECT PUR.PERSON_ID AS PERSON_ID,
          NULL AS LAST_CLICK_DATE,
          NULL AS LAST_OPEN_DATE,
          PUR.LAST_PURCHASE_DATE AS LAST_PURCHASE_DATE
     FROM PURCHASE_INTER PUR
     WHERE PUR.JOB_ID = 77318317
   ) GROUP BY PERSON_ID
 ) INTER LEFT JOIN CUSTOMER C ON INTER.PERSON_ID = C.PERSON_ID
         LEFT JOIN INTERACTION LSI ON C.ID = LSI.CUSTOMER_ID;
Sammy Pawar
  • 1,201
  • 3
  • 19
  • 38
  • Do you need to remove duplicates, or can you use `UNION ALL` instead of `UNION`? – jarlh Mar 30 '16 at 14:19
  • How many records match a given job? – Gordon Linoff Mar 30 '16 at 14:24
  • Do you really need to provide the `RN` column? That could be expensive to compute if you are returning a large number of rows. – John Bollinger Mar 30 '16 at 14:42
  • RN column is for pagination. This query will feed into next step of a batch which uses partition and RN is used to select the range. LS_ACTIVITY_INTER has 4662610 and LS_PURCHASE_INTER has 7730694 records. There may not be duplicates in the table for same job_id. Duplicate person_id exists in LS_PURCHASE_INTER but dates are different. – Sammy Pawar Mar 30 '16 at 15:00
  • @SammyPawar -- if you are going to do pagination you might want to calculate the page needed first before doing all the joining to get the data -- this could speed up the process. – Hogan Mar 30 '16 at 15:10
  • @Hogan, thanks. There is a different query which calculates total number of records and partitioner calculates divides it by some fixed number to process the records in parallel. But output needs to feed the rownumber for selecting the records. Above is just a select query. There is a where clause on top of this which restricts records based on rownumber like below `RN > :start AND RN <= :end` – Sammy Pawar Mar 30 '16 at 15:37
  • @SammyPawar -- my point is you don't need to find out (for example) "last click date" when you are selecting which page of person ids to use. Pick the page of person ids first then join to the table that tells you last click date. THIS WILL BE FASTER. Probably a lot faster if your person table, or the table that has last click dates are big. – Hogan Mar 30 '16 at 15:41

1 Answers1

5

Your query suggests the following indexes:

  • ACTIVITY_INTER(JOB_ID, PERSON_ID, LAST_CLICK_DATE, LAST_OPEN_DATE)
  • PURCHASE_INTER(JOB_ID, PERSON_ID, LAST_PURCHASE_DATE)
  • CUSTOMER(PERSON_ID)
  • INTERACTION(CUSTOMER_ID)

(For the first two indexes, the first column is more important than the other two, unless the number of matches is very large.)

Also, change the UNION to UNION ALL. UNION incurs overhead for removing duplicates -- and this is not possible (at least between the two subqueries) because each subquery returns different columns.

In addition, you might want to replace the first subquery with a full outer join:

SELECT COALESCE(a.PERSON_ID, p.PERSON_ID) as PERSON_ID,
       a.LAST_CLICK_DATE, a.LAST_OPEN_DATE,p.LAST_PURCHASE_DATE
FROM (SELECT ACT.PERSON_ID AS PERSON_ID,
             MAX(ACT.LAST_CLICK_DATE) AS LAST_CLICK_DATE,
             MAX(ACT.LAST_OPEN_DATE) AS LAST_OPEN_DATE
      FROM ACTIVITY_INTER ACT
      WHERE ACT.JOB_ID = 77318317
      GROUP BY ACT.PERSON_ID
     ) a FULL OUTER JOIN
     (SELECT PUR.PERSON_ID AS PERSON_ID,
             MAX(PUR.LAST_PURCHASE_DATE) AS LAST_PURCHASE_DATE
      FROM PURCHASE_INTER PUR
      WHERE PUR.JOB_ID = 77318317
      GROUP BY PER.PERSON_ID
     ) p
     ON a.PERSON_ID = p.PERSON_ID

This gives Oracle more options for optimization, because the aggregation is done directly on the tables -- making indexes and better statistics available for the processing.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786