MYSQL sum() for distinct rows

Question

I'm looking for help using sum() in my SQL query:

SELECT links.id, 
       count(DISTINCT stats.id) as clicks, 
       count(DISTINCT conversions.id) as conversions, 
       sum(conversions.value) as conversion_value 
FROM links 
LEFT OUTER JOIN stats ON links.id = stats.parent_id 
LEFT OUTER JOIN conversions ON links.id = conversions.link_id 
GROUP BY links.id 
ORDER BY links.created desc;

I use DISTINCT because I'm doing "group by" and this ensures the same row is not counted more than once.

The problem is that SUM(conversions.value) counts the "value" for each row more than once (due to the group by)

I basically want to do SUM(conversions.value) for each DISTINCT conversions.id.

Is that possible?

Posting the full query would be helpful. How do you duplicate values with a `GROUP BY`? — Matthew, Mar 12 '10 at 22:31
Are you doing a Join? You should post you query. There a few options you can use depending on the query. — Michael D. Irizarry, Mar 12 '10 at 22:54

score 94 · Accepted Answer · answered Mar 13 '10 at 21:17

94

I may be wrong but from what I understand

conversions.id is the primary key of your table conversions
stats.id is the primary key of your table stats

Thus for each conversions.id you have at most one links.id impacted.

You request is a bit like doing the cartesian product of 2 sets :

[clicks]
SELECT *
FROM links 
LEFT OUTER JOIN stats ON links.id = stats.parent_id 

[conversions]
SELECT *
FROM links 
LEFT OUTER JOIN conversions ON links.id = conversions.link_id

and for each link, you get sizeof([clicks]) x sizeof([conversions]) lines

As you noted the number of unique conversions in your request can be obtained via a

count(distinct conversions.id) = sizeof([conversions])

this distinct manages to remove all the [clicks] lines in the cartesian product

but clearly

sum(conversions.value) = sum([conversions].value) * sizeof([clicks])

In your case, since

count(*) = sizeof([clicks]) x sizeof([conversions])
count(*) = sizeof([clicks]) x count(distinct conversions.id)

you have

sizeof([clicks]) = count(*)/count(distinct conversions.id)

so I would test your request with

SELECT links.id, 
   count(DISTINCT stats.id) as clicks, 
   count(DISTINCT conversions.id) as conversions, 
   sum(conversions.value)*count(DISTINCT conversions.id)/count(*) as conversion_value 
FROM links 
LEFT OUTER JOIN stats ON links.id = stats.parent_id 
LEFT OUTER JOIN conversions ON links.id = conversions.link_id 
GROUP BY links.id 
ORDER BY links.created desc;

Keep me posted ! Jerome

answered Mar 13 '10 at 21:17

Jerome WAGNER

21,986
8
62
77

1

Great, this solution is perfect and quite universal when you don't want to deal with dependent subqueries solution which is not acceptable for large data sets. – Luke Adamczewski Mar 01 '13 at 15:28
1

Jeromes solution is actually wrong and can produce incorrect results!! See my answer below. – Clemens Valiente Mar 14 '14 at 15:32
4

@ClemensValiente, Jerome's solution is correct, *given that conversions.id is a unique column on the conversions table*. This is probably an important distinction to make and should be noted in the answer. EDIT -- actually, it is stated (conversions.id is the primary key of your table conversions) – Jonathan Mar 14 '14 at 18:10
And what way is better, subquery or Jeromes solution – yozzz Mar 11 '15 at 12:38
This is genius, amazing! it solves so many otherwise unfeasible queries! – Timo Huovinen May 26 '16 at 11:58
@yozzz a subquery will need a group by, the group by will trigger a temp table creation, the temp table will be generated for all the data every time the query is executed, this is really slow. Jerome's solution fixes that and should be much faster overall for most cases. – Timo Huovinen May 26 '16 at 11:59
long story short: divide original table sum on joined table(s) distinct row count, forgot and googled almost 2 times – BIOHAZARD Feb 01 '17 at 15:24
4

This is not a sufficient answer because it is dividing by the total number of rows in the join and this will give unwanted results if the parent table has multiple relationships in the joined table. – kjdion84 Aug 25 '17 at 00:07
I agree that you, sir, are a freaking genius. This is elegant, both mathematically and performance-wise and goes on top of my snippet list! – Nico R Feb 06 '19 at 12:20
Incredible man, I have been working with databases for years and never seen this before :) Congratulations! – Mickle Foretic Mar 24 '20 at 16:21
CASE WHEN can help if you sometimes have zero entries in the joined table. – Alexis R Jun 07 '21 at 17:02

score 14 · Answer 2 · answered Aug 08 '13 at 07:44

14

Jeromes solution is actually wrong and can produce incorrect results!!

sum(conversions.value)*count(DISTINCT conversions.id)/count(*) as conversion_value

let's assume the following table

conversions
id value
1 5
1 5
1 5
2 2
3 1

the correct sum of value for distinct ids would be 8. Jerome's formula produces:

sum(conversions.value) = 18
count(distinct conversions.id) = 3
count(*) = 5
18*3/5 = 9.6 != 8

answered Aug 08 '13 at 07:44

Clemens Valiente

829
1
8
16

5

Assuming that conversions.id is a unique field, there is no way that a JOIN could produce 3 rows with conversions.id = 1 and only 1 row where conversions.id = 2. The assumption that conversions.id is unique is implicit, and should probably be made explicit, but other than that, the formula is solid. – Jonathan Mar 14 '14 at 18:12
2

Is it `9.6` or `10.8`. by the way is there any solution other than multi-level sub query or join(sub query)? – James Dec 11 '19 at 10:46
@Jonathan You're right given the OP's exact query. Clemens is correct if this scenario were within a larger query with other joins causing multiple rows of the same conversions.id. Unfortunately that's the situation I'm confronting right now. – Rikaelus Aug 26 '21 at 21:12
@Rikaelus fair point. I guess another implicit assumption I hadn't thought of is the exact JOINs used. In the OP query, the GROUP BY is on a primary key `links.id`, there is a joined table/field on that `conversions.value` and there are *no other joins off of the conversions table*. This last clause is super important If the scenario you're in involves more joins off of the `links` table directly, you *might* be able to do a similar trick of calculating the sum a bit backwards. Otherwise, you might have to back off and to a dependent subquery or just multiple SQL queries. Good luck! – Jonathan Aug 28 '21 at 14:13

TehShrike · Answer 3 · 2014-03-07T15:19:33.157

9

For an explanation of why you were seeing incorrect numbers, read this.

I think that Jerome has a handle on what is causing your error. Bryson's query would work, though having that subquery in the SELECT could be inefficient.

edited Mar 07 '14 at 15:19

answered Mar 20 '10 at 17:48

TehShrike

9,855
2
33
28

1

Thanks for this answer! I recommend everyone arriving on this page to read the linked article. It's a concise, condescending and comic explanation of joins and grouping, and offers a _proper_ solution to the issue. – Blaise Jan 19 '21 at 14:15

score 6 · Answer 4 · answered Mar 18 '10 at 21:46

Use the following query:

SELECT links.id
  , (
    SELECT COUNT(*)
    FROM stats
    WHERE links.id = stats.parent_id
  ) AS clicks
  , conversions.conversions
  , conversions.conversion_value
FROM links
LEFT JOIN (
  SELECT link_id
    , COUNT(id) AS conversions
    , SUM(conversions.value) AS conversion_value
  FROM conversions
  GROUP BY link_id
) AS conversions ON links.id = conversions.link_id
ORDER BY links.created DESC

score 3 · Answer 5 · answered Mar 12 '10 at 22:49

3

I use a subquery to do this. It eliminates the problems with grouping. So the query would be something like:

SELECT COUNT(DISTINCT conversions.id)
...
     (SELECT SUM(conversions.value) FROM ....) AS Vals

answered Mar 12 '10 at 22:49

Dave

1,234
13
24

Updated question with my full query. I'm not sure how I'd integrate a subquery into what I have and how it would affect performance. – makeee Mar 12 '10 at 23:06
Subqueries normally impact performance negatively. To minimize the impact make sure any subquery is acting on an index. – Dave Mar 15 '10 at 17:17

score 2 · Answer 6 · edited Jul 04 '22 at 17:06

How about something like this:

select l.id, count(s.id) clicks, count(c.id) clicks, sum(c.value) conversion_value
from    (SELECT l.id id, l.created created,
               s.id clicks,  
               c.id conversions,  
               max(c.value) conversion_value                    
        FROM links l
        LEFT JOIN stats s ON l.id = s.parent_id
        LEFT JOIN conversions c ON l.id = c.link_id  
        GROUP BY l.id, l.created, s.id, c.id) t
order by t.created

score 1 · Answer 7 · answered Oct 14 '13 at 20:06

This will do the trick, just divide the sum with the count of conversation id which are duplicate.

SELECT a.id,
       a.clicks,
       SUM(a.conversion_value/a.conversions) AS conversion_value,
       a.conversions
FROM (SELECT links.id, 
       COUNT(DISTINCT stats.id) AS clicks, 
       COUNT(conversions.id) AS conversions, 
       SUM(conversions.value) AS conversion_value 
      FROM links 
      LEFT OUTER JOIN stats ON links.id = stats.parent_id 
      LEFT OUTER JOIN conversions ON links.id = conversions.link_id 
      GROUP BY conversions.id,links.id
      ORDER BY links.created DESC) AS a
GROUP BY a.id

score 0 · Answer 8 · answered Aug 05 '20 at 07:54

Select sum(x.value) as conversion_value,count(x.clicks),count(x.conversions)
FROM
(SELECT links.id, 
       count(DISTINCT stats.id) as clicks, 
       count(DISTINCT conversions.id) as conversions,
       conversions.value,       
FROM links 
LEFT OUTER JOIN stats ON links.id = stats.parent_id 
LEFT OUTER JOIN conversions ON links.id = conversions.link_id 
GROUP BY conversions.id) x
GROUP BY x.id 
ORDER BY x.created desc;

I believe this will give you the answer that you are looking for.

MYSQL sum() for distinct rows

8 Answers8

Linked

Related