4

I have the following mysql table called "pics", with the following fields and sample data:

id   vehicle_id    filename    priority
1    45            a.jpg       4
2    45            b.jpg       1
3    56            f.jpg       4
4    67            cc.jpg      4
5    45            kt.jpg      3
6    67            gg.jpg      1

Is it possible, in a single query, to get one row for each vehicle_id, and the row be the highest priority?

The result I'm looking for:

array (
  [0] => array( [id] => '2', [vehicle_id] => '45', [filename] => 'b.jpg',  [priority] => '1' ),
  [1] => array( [id] => '3', [vehicle_id] => '56', [filename] => 'f.jpg',  [priority] => '4' ),
  [2] => array( [id] => '6', [vehicle_id] => '67', [filename] => 'gg.jpg', [priority] => '1' )
);

If not possible in a single query, what would be the best approach?

Thanks!

OMG Ponies
  • 325,700
  • 82
  • 523
  • 502
briang
  • 43
  • 1
  • 4
  • What if a vehicle_id has 2 rows tied for highest. Do you want both back? – Martin Smith Jun 22 '10 at 21:39
  • No, then i would default to the earliest posted. Only 1 row per vehicle_id. – briang Jun 22 '10 at 21:52
  • @briang: How do you know which is the earliest posted? The one with the lowest id? – Mark Byers Jun 22 '10 at 21:54
  • Yes, lowest id. I also have a date_posted field. But I left it out for this example to try and keep it simple. – briang Jun 22 '10 at 22:03
  • @briang: How many rows and vehicle_ids are there in your table? How important is performance? – Mark Byers Jun 22 '10 at 22:28
  • @Mark: There will be thousands+ eventually. Performance is important. – briang Jun 22 '10 at 22:33
  • @briang: "thousands+" doesn't say much. An upper limit is more useful than a lower limit. Could there be millions of rows? Hundreds of millions? How many vehicle_ids will there be compared to the number of rows in the table? Will there be many vehicle_ids with a small number of rows in each, or just a few vehicles with many rows for each vehicle. The cardinality can be important in deciding which query is fastest. For just a few thousand rows performance is probably not an issue - no matter how you write the query it will be nearly instant, even without indexes. Will you profile performance? – Mark Byers Jun 22 '10 at 22:42
  • @Mark: Yes, potentially millions of rows. I'd LIKE to see hundreds of millions, all depends on the success of the site. But I would like to build for that possibility. – briang Jun 22 '10 at 22:50

3 Answers3

5

This seems to be a typical max per group query. In most databases you could easily do this using ROW_NUMBER:

SELECT id, vehicle_id, filename, priority
FROM (
    SELECT 
        id, vehicle_id, filename, priority,
        ROW_NUMBER() OVER (PARTITION BY vehicle_id
                           ORDER BY priority DESC, id) AS rn
    FROM pics
) AS T1
WHERE rn = 1

Since MySQL doesn't yet support ROW_NUMBER you can emulate it using variables:

SELECT id, vehicle_id, filename, priority
FROM (
    SELECT
        id, vehicle_id, filename, priority,
        @rn := CASE WHEN @prev_vehicle_id = vehicle_id
                    THEN @rn + 1
                    ELSE 1
               END AS rn,
        @prev_vehicle_id := vehicle_id
    FROM (SELECT @prev_vehicle_id := NULL) vars, pics T1
    ORDER BY vehicle_id, priority DESC, id
) T2
WHERE rn = 1
Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
  • For some reason, people find variables scary in MySQL indeed, I guess it's the unfamiliarity with the '@'s and ':=' compared to ANSI SQL which scare them off. It shouldn't of course, variables can be a major boost to performance. – Wrikken Jun 22 '10 at 23:20
5

Note that this method was posted before it was made clear that it is required to handle priority ties. I'm leaving it here for reference (see comments below). Check out @Mark's answer for a solution that handles ties as required:

SELECT p.id, p.vehicle_id, p.filename, p.priority
FROM   pics p
JOIN   (
           SELECT   vehicle_id, MAX(priority) max_priority
           FROM     pics
           GROUP BY vehicle_id
       ) sub_p ON (sub_p.vehicle_id = p.vehicle_id AND 
                   sub_p.max_priority = p.priority)
GROUP BY p.vehicle_id;

This assumes that there can be no priority ties for the same vehicle_id.

Test case:

CREATE TABLE pics (id int, vehicle_id int, filename varchar(10), priority int);

INSERT INTO pics VALUES ('1', '45', 'a.jpg', '4');
INSERT INTO pics VALUES ('2', '45', 'b.jpg', '1');
INSERT INTO pics VALUES ('3', '56', 'f.jpg', '4');
INSERT INTO pics VALUES ('4', '67', 'cc.jpg', '4');
INSERT INTO pics VALUES ('5', '45', 'kt.jpg', '3');
INSERT INTO pics VALUES ('6', '67', 'gg.jpg', '1');

Result:

+------+------------+----------+----------+
| id   | vehicle_id | filename | priority |
+------+------------+----------+----------+
|    1 |         45 | a.jpg    |        4 |
|    3 |         56 | f.jpg    |        4 |
|    4 |         67 | cc.jpg   |        4 |
+------+------------+----------+----------+
3 rows in set (0.01 sec)
Community
  • 1
  • 1
Daniel Vassallo
  • 337,827
  • 72
  • 505
  • 443
  • This only works if they need a single record, min/max at that. This won't work if they want top 5/10/etc per. – OMG Ponies Jun 22 '10 at 21:45
  • @OMG: That's true. But I think that's the only disadvantage, isn't it? ... The OP doesn't seem to require that: "the row be the highest priority" – Daniel Vassallo Jun 22 '10 at 21:47
  • He did say he explicitly that he only wanted one row so I don't think that's a serious problem. A more serious problem is that this can return more than one row, though it is fixable. – Mark Byers Jun 22 '10 at 21:49
  • 2
    Never know what someone's requirements are, never too late for a bait'n'switch. – OMG Ponies Jun 22 '10 at 21:49
  • 1
    PS: Check the question again - he just posted his requirements for a tie break - "first posted". Looks like you will need that fix after all... – Mark Byers Jun 22 '10 at 21:57
  • @Mark: Yes, I updated my answer so that it is clear that this assumes that there can be no priority ties... I tend to avoid using variables in MySQL unless strictly necessary. However I'm not sure if this "tendency" is justified. If the case won't have priority ties, would you still go for the variables approach? (just curious) – Daniel Vassallo Jun 22 '10 at 21:58
  • 1
    @Daniel Vassallo: This approach should give higher performance than the variable approach, so that's one reason to go with this over the method I suggested. But this method is harder to get right. The number of columns involved in the tiebreaking changes the structure of the query. To modify my answer to include the extra column was trivial. So I think both answers have pros and cons. Shame that there is no obvious best answer to this straightforward question. – Mark Byers Jun 22 '10 at 22:07
  • @Mark: That's perfectly true. It's probably more complicated than it's worth to handle tie rules correctly. – Daniel Vassallo Jun 22 '10 at 22:14
5

While this may be the 'accepted' answer, the performance of Mark's solution is under normal circumstances many times better, and equally valid for the question, so by all means, go for his solution in production!


SELECT a.id, a.vehicle_id, a.filename, a.priority
FROM pics a
LEFT JOIN pics b               -- JOIN for priority
ON b.vehicle_id = a.vehicle_id 
AND b.priority > a.priority
LEFT JOIN pics c               -- JOIN for priority ties
ON c.vehicle_id = a.vehicle_id 
AND c.priority = a.priority 
AND c.id < a.id
WHERE b.id IS NULL AND c.id IS NULL

Assuming 'id' is a non-nullable column.

[edit]: my bad, need second join, cannot do it with just one.

Community
  • 1
  • 1
Wrikken
  • 69,272
  • 8
  • 97
  • 136
  • 1
    Won't this be far too slow? Have you tested the performance of it? In what way are the other answers not 'a single query'? – Mark Byers Jun 22 '10 at 23:08
  • @Mark: I am a bit unfamiliar with actual performance testing. I only assumed that a single query would be faster than multiple. I would like to test these out. Is there a particular resource I could test these options with? Thanks for the help! – briang Jun 22 '10 at 23:13
  • The 'not a single query' is indeed not very to the point imo, as all answers here are. This won't be slow with a proper `INDEX(vehicle_id,priority)`, MySQL recognizes the 'LEFT JOIN with no results'-construct with quite an optimised query plain in my experience. Difference may vary depending on the amount of records per vehicle ID. If I had to guess (and the table was properly indexed), I'd say Mark Byers' answer probably performs better with a low amount of records per vehicle_id, while this one can take more advantage of the index on high amounts of records per vehicle_id. – Wrikken Jun 22 '10 at 23:16
  • That is: compared to themselves. Comparing the performance of both those answers in the real life situation is what should be done. – Wrikken Jun 22 '10 at 23:17
  • @Mark: Sorry, missed your trailing question "In what way are the other answers not 'a single query'?". Isn't a "sub-query" an additional query? – briang Jun 22 '10 at 23:21
  • @Wrikken: Even with the index you suggested for 10000 records in pics (100 vehicle_ids, 7 priority levels) I time this at 1.3 seconds. My solution takes 0.09s for the same data. I can post scripts to generate the test data I used if you wish. I wish something this simple was fast... maybe I'm missing something, but I'm very skeptical that this method will be as fast as the other two solutions. – Mark Byers Jun 22 '10 at 23:31
  • @Mark: will run some tests tomorrow, I'm interested too, and how it may or may not alter with the amount of records per vehicle id, and w. If you already have the script in place to generate the testdata I'd be grateful if you could pastebin it somewhere, seems a waste to write my own :) Of course, I will destroy the (small) advantage of probably having the records already in order in the table just to make a point, but I don't imagine that will influence the result considerably. – Wrikken Jun 22 '10 at 23:51
  • @Mark: sorry for the delay, have been busy: a.t.m, with setting the most biased environment I could think of for my join solution, it is still about 3 times as slow as yours (version 5.1.47). Will keep your solution in mind. – Wrikken Jul 07 '10 at 13:18
  • Very useful. I knew the self-join strategy aimed at "Not Exists" optimization, but I couldn't think of double self-join. Thank you very much. – mpyw Nov 06 '18 at 05:32