3

I'm creating a application that will generate lists for email marketing campaigns. I have tables for contacts, emails, and campaigns. A campaign has many emails and a contact has many emails. The email is related to a contact and a campaign. Basically a table for a MANY to MANY relationship except I have other fields in the table for the result of the email (clicked, opened, unsubscribed, etc). There are also other tables but this is where I'm having the trouble.

I'm trying to use NOT IN with a subquery to get a list of contacts who have not received an email since a certain date with other conditions. An example query is this:

SELECT * 
FROM `contact` `t` 
WHERE (unsubscribed='1')
  AND t.id NOT IN 
   (SELECT distinct contact_id 
    FROM email, campaign 
    WHERE email.campaign_id = campaign.id 
      AND campaign.date_sent >= '2012-07-12') 
ORDER BY rand() 
LIMIT 10000

This returns 0 result. However, if I run the first condition:

select id 
from contact 
where unsubscribed=1

I have 9075 rows. Then, if I separately run the subquery:

SELECT distinct contact_id 
FROM email, campaign 
WHERE email.campaign_id = campaign.id 
  AND campaign.date_sent >= '2012-07-12'

I have 116612 rows. Out of each of those results, I end up with 826 values that are duplicates. From what I can understand, this means that 9075-826=8249 records ARE unsubscribed=1 AND NOT IN the second query. So, my first query should be returning 8249 results but it is returning 0. I must be structuring the query wrong or using the wrong operators but I can not for the life of me figure out how to get this right.

Can anyone help? So many thanks in advance as this has had me stumped for like 3 days! :)

hcarver
  • 7,126
  • 4
  • 41
  • 67
BVBAccelerate
  • 172
  • 1
  • 5
  • 17
  • how did you check this: 826 values that are duplicates? – Razvan Aug 02 '12 at 16:04
  • Please give the result of `SELECT * FROM email, campaign WHERE email.campaign_id = campaign.id AND campaign.date_sent >= '2012-07-12' AND contact_id IS NULL` – Sebas Aug 02 '12 at 16:14
  • 1
    "From what I can understand, this means that 9075-826=8249 records ARE unsubscribed=1 AND NOT IN the second query." Uhm, no. It means you have 116612 - 826 = 115786 unique contact_id's that you don't want to include. Whether or not some or all of the 9075 contacts are or aren't in that set, is another question. – Roland Bouman Aug 02 '12 at 16:24
  • @ Razvam - I exported the results into Excel and removed duplicates and this is the number it gave me. – BVBAccelerate Aug 02 '12 at 17:38
  • @Roland, I'm not sure if I'm understanding you. The result of the first condition gives me 9075 unsubscribed people. From that I want to get all those people who have not received an email since July 12th, so I shouldn't be able to get more than 9075 results. If I'm incorrect in how I'm understanding it then I need to rewrite the query. Either way, my query is returning 0 results which goes against what we both are thinking! Thanks for all the help so far! – BVBAccelerate Aug 02 '12 at 17:48
  • "so I shouldn't be able to get more than 9075 results." agreed. "Either way, my query is returning 0 results which goes against what we both are thinking!" < uhm, no. 0 is less than 9075, so it could certainly be correct. Your first condition is applied to contacts, the second to campaign and email. How can you be so sure the result should not in fact be exactly 0? – Roland Bouman Aug 02 '12 at 19:06
  • Good point Roland. I've gone and looked in the database to make sure of this. I'll run the first part of the query to get people who have unsubscribed. id 1 is in this list. Then, I'll run the query to get distinct contact_id's of everyone who has received an email since July 12th. id 1 is not in this list. So, since id 1 is in the unsubscribed result and NOT IN the result for having received an email since July 12th, I figure that this contact should show up in the result. – BVBAccelerate Aug 02 '12 at 19:42
  • Okay, so I'm about to lose my mind. I just copied and pasted that exact query that I had above into MySQL Workbench, ran it, and got 8422 rows. Is there any way that a certain load on my server or something could have caused me to receive a bunk result like that? Basically it just started magically working and I can't explain this because it's the exact same query on the exact same data. – BVBAccelerate Aug 02 '12 at 20:01
  • @BVBAccelerate, "Is there any way that a certain load on my server or something could have caused me to receive a bunk result like that?" well, in a philosophical sense, nothing is impossible :p. However it seems so unlikely that I have no hesitation to immediately dismiss that thought :) Most likely, you just had some mistake in your query – Roland Bouman Aug 03 '12 at 06:10

3 Answers3

10

This is because

SELECT 1 FROM DUAL WHERE 1 NOT IN (NULL, 2) 

won't return anything, whereas

SELECT 1 FROM DUAL WHERE 1 NOT IN (2)

will.

Please review the behaviour of NOT IN and NULL in MYSQL.

For your concern you should get away with it using NOT EXISTS instead of NOT IN:

SELECT * FROM `contact` `t` 
WHERE (unsubscribed='1')
AND NOT EXISTS (
    SELECT * FROM email, campaign 
    WHERE 
        email.campaign_id = campaign.id 
    AND campaign.date_sent >= '2012-07-12'
    AND t.id = contact_id
) 
ORDER BY rand() 
LIMIT 10000
Sebas
  • 21,192
  • 9
  • 55
  • 109
  • Hey Sebas, I've tried using NOT EXISTS in the query already and it hasn't worked for me. This is what I tried: SELECT * FROM `contact` `t` WHERE (1=1) AND (unsubscribed='1') AND NOT EXISTS (SELECT distinct contact_id FROM email, campaign WHERE email.campaign_id = campaign.id AND campaign.date_sent >= '2012-07-01') ORDER BY rand() LIMIT 10000 . It gave me the same result as using NOT IN above. – BVBAccelerate Aug 02 '12 at 19:53
  • Hi @BVBAccelerate, I added the query. Regards. – Sebas Aug 02 '12 at 23:13
  • It did work with this query but was only returning 36 results for me which I did not expect. I'll look into some expected results and compare them against what I'm getting and if it is accurate I will let you know. – BVBAccelerate Aug 06 '12 at 17:12
5

Just wasted a few hours and a few hairs on this as well.

Was not able to get "NOT exists" to work as the accepted answer mentioned. However, you can simply throw in a NOT NULL in

WHERE "field you are aggregating is not NULL", and that did the job.

SELECT * 
FROM `contact` `t` 
WHERE (unsubscribed='1')
  AND t.id NOT IN 
   (SELECT distinct contact_id 
    FROM email, campaign 
    WHERE email.campaign_id = campaign.id 
      AND campaign.date_sent >= '2012-07-12'
      AND contact_id is not NULL          ###*************added line
) 
ORDER BY rand() 
LIMIT 10000
FlyingZebra1
  • 1,285
  • 1
  • 18
  • 28
0
select c.*, e.id from contact as c 
left join email as e on c.id = e.contact_id and e.date_sent >= '2012-07-12' 
where e.id is null and c.unsubscribed = 1

I think campaign.date_sent was a typo? It must be email.date_sent?

  • why did you leave out the join between email and campaign? looks like a pretty essential bit. – Roland Bouman Aug 02 '12 at 16:25
  • All of the emails for a campaign are sent out on a date, and that date is date_sent from the campaign table. Rather than putting that date_sent on each email it is on the campaign. – BVBAccelerate Aug 02 '12 at 17:39