Improve the performance

Question

I have two sets of data coming from external source - purchase date of the customer and last email click/open date of the customer. This is stored in two tables PURCHASE_INTER AND ACTIVITY_INTER tables respectively. Purchase data is in multiple and I need to pick up last purchase date. But activity data is unique for each customer. The data is independent of each other and other set of data may not be present. We have written below query which combines two tables, groups them based on person_id which is id of the customer coming from external source and get the latest of the dates, join with our customer table to get the customer email, and again join with another table where this data will be ultimately stored for sake of knowing whether it is insert or update operation. Could you please suggest how I can improve the performance of this query. It is terribly slow and taking more than 10 hours. There are millions of records coming in PURCHASE_INTER AND ACTIVITY_INTER tables.

SELECT INTER.*, C.ID AS CUSTOMER_ID, C.EMAIL AS CUSTOMER_EMAIL, LSI.ID AS INTERACTION_ID, ROW_NUMBER() OVER (ORDER BY PERSON_ID ASC) AS RN FROM (
   SELECT PERSON_ID               AS PERSON_ID,
        MAX(LAST_CLICK_DATE)    AS LAST_CLICK_DATE,
        MAX(LAST_OPEN_DATE)     AS LAST_OPEN_DATE,
        MAX(LAST_PURCHASE_DATE) AS LAST_PURCHASE_DATE
   FROM (
     SELECT ACT.PERSON_ID AS PERSON_ID,
          ACT.LAST_CLICK_DATE AS LAST_CLICK_DATE,
          ACT.LAST_OPEN_DATE AS LAST_OPEN_DATE,
          NULL AS LAST_PURCHASE_DATE
     FROM ACTIVITY_INTER ACT
     WHERE ACT.JOB_ID = 77318317
     UNION
     SELECT PUR.PERSON_ID AS PERSON_ID,
          NULL AS LAST_CLICK_DATE,
          NULL AS LAST_OPEN_DATE,
          PUR.LAST_PURCHASE_DATE AS LAST_PURCHASE_DATE
     FROM PURCHASE_INTER PUR
     WHERE PUR.JOB_ID = 77318317
   ) GROUP BY PERSON_ID
 ) INTER LEFT JOIN CUSTOMER C ON INTER.PERSON_ID = C.PERSON_ID
         LEFT JOIN INTERACTION LSI ON C.ID = LSI.CUSTOMER_ID;

Do you need to remove duplicates, or can you use `UNION ALL` instead of `UNION`? — jarlh, Mar 30 '16 at 14:19
Do you really need to provide the `RN` column? That could be expensive to compute if you are returning a large number of rows. — John Bollinger, Mar 30 '16 at 14:42
RN column is for pagination. This query will feed into next step of a batch which uses partition and RN is used to select the range. LS_ACTIVITY_INTER has 4662610 and LS_PURCHASE_INTER has 7730694 records. There may not be duplicates in the table for same job_id. Duplicate person_id exists in LS_PURCHASE_INTER but dates are different. — Sammy Pawar, Mar 30 '16 at 15:00
@SammyPawar -- if you are going to do pagination you might want to calculate the page needed first before doing all the joining to get the data -- this could speed up the process. — Hogan, Mar 30 '16 at 15:10
@Hogan, thanks. There is a different query which calculates total number of records and partitioner calculates divides it by some fixed number to process the records in parallel. But output needs to feed the rownumber for selecting the records. Above is just a select query. There is a where clause on top of this which restricts records based on rownumber like below `RN > :start AND RN <= :end` — Sammy Pawar, Mar 30 '16 at 15:37
@SammyPawar -- my point is you don't need to find out (for example) "last click date" when you are selecting which page of person ids to use. Pick the page of person ids first then join to the table that tells you last click date. THIS WILL BE FASTER. Probably a lot faster if your person table, or the table that has last click dates are big. — Hogan, Mar 30 '16 at 15:41

Gordon Linoff · Accepted Answer · 2016-03-31T01:54:08.493

Your query suggests the following indexes:

ACTIVITY_INTER(JOB_ID, PERSON_ID, LAST_CLICK_DATE, LAST_OPEN_DATE)
PURCHASE_INTER(JOB_ID, PERSON_ID, LAST_PURCHASE_DATE)
CUSTOMER(PERSON_ID)
INTERACTION(CUSTOMER_ID)

(For the first two indexes, the first column is more important than the other two, unless the number of matches is very large.)

Also, change the UNION to UNION ALL. UNION incurs overhead for removing duplicates -- and this is not possible (at least between the two subqueries) because each subquery returns different columns.

In addition, you might want to replace the first subquery with a full outer join:

SELECT COALESCE(a.PERSON_ID, p.PERSON_ID) as PERSON_ID,
       a.LAST_CLICK_DATE, a.LAST_OPEN_DATE,p.LAST_PURCHASE_DATE
FROM (SELECT ACT.PERSON_ID AS PERSON_ID,
             MAX(ACT.LAST_CLICK_DATE) AS LAST_CLICK_DATE,
             MAX(ACT.LAST_OPEN_DATE) AS LAST_OPEN_DATE
      FROM ACTIVITY_INTER ACT
      WHERE ACT.JOB_ID = 77318317
      GROUP BY ACT.PERSON_ID
     ) a FULL OUTER JOIN
     (SELECT PUR.PERSON_ID AS PERSON_ID,
             MAX(PUR.LAST_PURCHASE_DATE) AS LAST_PURCHASE_DATE
      FROM PURCHASE_INTER PUR
      WHERE PUR.JOB_ID = 77318317
      GROUP BY PER.PERSON_ID
     ) p
     ON a.PERSON_ID = p.PERSON_ID

This gives Oracle more options for optimization, because the aggregation is done directly on the tables -- making indexes and better statistics available for the processing.

Thanks all for your help on this. It was really helpful. I learned many new things. — Sammy Pawar, Apr 01 '16 at 09:17

Improve the performance

1 Answers1