1

Good morning. I'm working in Responsys Interact, which is an Oracle-based email campaign management type SAAS product. I'm creating a query to basically filter a target list for an email campaign designed to target a specific sub-set of our master email contact list. Here's the query I created a few weeks ago that appears to work:

/*

 Table              Symbolic Name   
 CONTACTS_LIST          $A$   
 Engaged                $B$   
 TRANSACTIONS_RAW       $C$   
 TRANSACTION_LINES_RAW  $D$  

-- A Responsys Filter (Engaged) will return only an RIID_, nothing else, according to John @ Responsys....so,....let's join on that to contact list...

*/
SELECT     
DISTINCT $A$.EMAIL_ADDRESS_,   
$A$.RIID_,     
$A$.FIRST_NAME,      
$A$.LAST_NAME,     
$A$.EMAIL_PERMISSION_STATUS_         

FROM   
$A$      
JOIN   $B$  ON  $B$.RIID_ = $A$.RIID_             
LEFT JOIN   $C$  ON  $C$.EMAIL_ADDRESS_ = $A$.EMAIL_ADDRESS_       
LEFT JOIN   $D$  ON  $D$.TRANSACTION_ID = $C$.TRANSACTION_ID         

WHERE   
$A$.EMAIL_DOMAIN_ NOT IN ('none.com', 'noemail.com', 'mailinator.com', 'nomail.com')  AND   
/* don't include hp customers */   
$A$.HP_PLAN_START_DATE IS NULL   AND    
$A$.EMAIL_ADDRESS_ NOT IN 
(
    SELECT   
    $C$.EMAIL_ADDRESS_      

    FROM 
    $C$      
    JOIN $D$ ON $D$.TRANSACTION_ID = $C$.TRANSACTION_ID      

    WHERE 
    /* Get only purchase transactions for certain item_id's/SKU's */
    ($D$.ITEM_FAMILY_ID IN (3,4,5,8,14,15) OR $D$.ITEM_ID IN (704,769,1893,2808,3013) ) AND    
    /* .... within last 60 days (i.e. 2 months) */  
    $A$.TRANDATE > ADD_MONTHS(CURRENT_TIMESTAMP, -2) 
)
;

This seems to work, in that if I run the query without the sub-query, we get 720K rows; and if I add back the 'AND NOT IN...' subquery, we get about 700K rows, which appears correct based on what my user knows about her data. What I'm (supposedly) doing with the NOT IN subquery is filtering out any email addresses where the customer has purchased certain items from us in the last 60 days.

So, now I need to add in another constraint. We still don't want customers who made certain purchases in the last 60 days as above, but now also we want to exclude customers who have purchased another particular item, but now within the last 12 months. So, I thought I would add another subquery, as shown below. Now, this has introduced several problems:

  1. Performance - the query, which took a couple minutes to run before, now takes quite a few more minutes to run - in fact it seems to time out....
  2. So, I wondered if there's an issue having two subqueries, but before I went to think about alternatives to this, I decided to test my new subquery by temporarily deleting the first subquery, so that I had just one subquery similar to above, but with the new item = 11 and within the last 12 months logic. And so with this, the query finally returned after a few minutes now, but with zero rows.
  3. Trying to figure out why, I tried simply changing the AND NOT IN (subquery) to AND IN (subquery), and that worked, in that it returned a few thousand rows, as expected.
  4. So why would the same SQL when using AND IN (subquery) "work", but the exact same SQL simply changed to AND NOT IN (subquery) return zero rows, instead of what I would expect which would be my 700 something thousdand plus rows, less the couple thousand encapsulated by the subquery result?
  5. Also, what is the best i.e. most performant way to accomplish what I'm trying to do, which is filter by some purchases made within one date range, AND by some other purchases made within a different date range?

Here's the modified version:

SELECT    
DISTINCT $A$.EMAIL_ADDRESS_,  
$A$.RIID_,    
$A$.FIRST_NAME,     
$A$.LAST_NAME,    
$A$.EMAIL_PERMISSION_STATUS_        

FROM  
$A$     
JOIN   $B$  ON  $B$.RIID_ = $A$.RIID_            
LEFT JOIN   $C$  ON  $C$.EMAIL_ADDRESS_ = $A$.EMAIL_ADDRESS_      
LEFT JOIN   $D$  ON  $D$.TRANSACTION_ID = $C$.TRANSACTION_ID        

WHERE  
$A$.EMAIL_DOMAIN_ NOT IN ('none.com', 'noemail.com', 'mailinator.com', 'nomail.com')  AND  
/* don't include hp customers */  
$A$.HP_PLAN_START_DATE IS NULL   AND   
$A$.EMAIL_ADDRESS_ NOT IN
(
   SELECT  
   $C$.EMAIL_ADDRESS_     

   FROM
   $C$     
   JOIN $D$ ON $D$.TRANSACTION_ID = $C$.TRANSACTION_ID     

   WHERE
   /* Get only purchase transactions for certain item_id's/SKU's */   
   ($D$.ITEM_FAMILY_ID IN (3,4,5,8,14,15) OR $D$.ITEM_ID IN (704,769,1893,2808,3013) ) AND   
   /* .... within last 60 days (i.e. 2 months) */   
   $C$.TRANDATE > ADD_MONTHS(CURRENT_TIMESTAMP, -2)    
) 
AND 
$A$.EMAIL_ADDRESS_ NOT IN 
(
    /* get purchase transactions for another type of item within last year */
    SELECT 
    $C$.EMAIL_ADDRESS_
    FROM
    $C$     
    JOIN $D$ ON $D$.TRANSACTION_ID = $C$.TRANSACTION_ID
    WHERE
    $D$.ITEM_FAMILY_ID = 11 AND $C$.TRANDATE > ADD_MONTHS(CURRENT_TIMESTAMP, -12)
)
;

Thanks for any ideas/insights. I may be missing or mis-remembering some basic SQL concept here - if so please help me out! Also, Responsys Interact runs on top of Oracle - it's an Oracle product - but I don't know off hand what version/flavor. Thanks!

JMoll
  • 11
  • 3
  • You really need to look at the execution plans for the various queries to see what join orders, indexes and filters are being used. Which might be tricky through the middle layer, unless you can run the queries against the underlying tables. I'd guess `trandate` is not selective enough with the 12-month range, and either `item_id` or `item_family_id` isn't indexed (are you using the right one in the new filter?). But... you're already left joining to `$D$`, so do you need to join again in either subquery? – Alex Poole Oct 30 '14 at 16:51
  • Thanks for the reply, Alex. Your point about trandate not being selective enough within the 12 month range was actually my first thought when I first ran the modified query. But knowing the data, I was pretty sure we've had purchases for this item in the last 12 months. So, to test that, I got rid of the first original subquery to focus on just the new subquery, and changed the 'NOT IN' to 'IN' to see if I would really get zero customers purchasing that item in the last 12 months.... – JMoll Oct 30 '14 at 17:03
  • ...(continued from comment above) This returned a total of 2683 rows, along the lines of what was expected. But then when I changed it back from 'IN' to 'NOT IN', I got zero rows. I would have expected to get the 700k minus the 2683.... so there's something I'm not seeing correctly in all this.... – JMoll Oct 30 '14 at 17:03
  • I meant selective in the sense that there's too much data in that date range for an index on that column to be effective; and if `item_family_id` isn't indexed then it might lead to a full table scan. (I'm still not sure if you meant family 11 or item 11.) How many rows does the subquery return on its own? If the `in` is finding 2683 rows, you'd expect the `not in` to find 1.4m maybe - is the query completing successfully with zero rows, or maybe failing and you're not seeing the exception for some reason? Failing with a memory/temp space issue rather than a syntax error, for example. – Alex Poole Oct 30 '14 at 17:37
  • Thanks Alex. Ok, I understand what you mean now by 'selective'. Responsys does have an explain option of sorts, so I looked at that, and then went and built some indexes. Re-running the explain now gives me no warnings, but still takes a very long time to run, but now returns over 700K rows... you're prob. right that the NOT IN prior to adding indexes was not really returning 0 records, it must have just been timing/erroring out... Thx for the help! – JMoll Oct 30 '14 at 18:45

2 Answers2

0

Looks like my problem with the new subquery was due to poor performance due to lack of indexes. Thanks to Alex Poole's comments, I looked in Responsys and there is a facility to get an 'explain' type analysis, and it was throwing warnings, and suggesting I build some indexes. Found the way to do that on the data sources, went back to the explain, and it said, "The query should run without placing an unnecessary burden on the system". And while it still ran for quite a few minutes, it did finally come back with close to the expected number of rows.

Now, I'm on to tackle the other half of the issue, which is to now incorporate this second sub-query in addition to the first, original subquery....

JMoll
  • 11
  • 3
0

Ok, upon further testing/analysis and refining my stackoverflow search critieria, the answer to the main part of my question dealing with the IN vs. NOT IN can be found here: SQL "select where not in subquery" returns no results

My performance was helped by using Responsys's explain-like feature and adding some indexes, but when I did that, I also happened to add in a little extra SQL in my sub-query's WHERE clause.... when I removed that, even after indexes built, I was back to zero rows returned. That's because as it turned out at least one of the transactions rows for the item family id I was interested in for this additional sub-query had a null value for email address. And as further explained in the link above, when using NOT IN, as soon as you have a null value involved, SQL can't definitively say it's NOT IN, since you can't really compare to null, so as soon as you have a null, the sub-query's going to evaluate 'false', thus zero rows. When using IN, even though there are nulls present, if you get one positive match, well, that's a match, so the sub-query returns 'true', so that's why you'll get rows with IN, but not with NOT IN. I hadn't realized that some of our transaction data may have null email addresses - now I know, so I just added a where not null to the where clause for the email address, and now all's good.

Community
  • 1
  • 1
JMoll
  • 11
  • 3