51

I have data that looks like this:

entities
id         name
1          Apple
2          Orange
3          Banana

Periodically, a process will run and give a score to each entity. The process generates the data and adds it to a scores table like so:

scores 
id  entity_id    score   date_added
1    1            10       1/2/09
2    2            10       1/2/09
3    1            15       1/3/09
4    2            10       1/03/09
5    1            15       1/4/09
6    2            15       1/4/09
7    3            22       1/4/09

I want to be able to select all of the entities along with the most recent recorded score for each resulting in some data like this:

entities
id name     score  date_added
1  Apple     15     1/4/09
2  Orange    15     1/4/09
3  Banana    15     1/4/09

I can get the data for a single entity using this query:

SELECT entities.*, 
       scores.score, 
       scores.date_added 
FROM entities

INNER  JOIN scores
ON entities.id = scores.entity_id

WHERE entities.id = ?

ORDER BY scores.date_added DESC
LIMIT 1

But I'm at a loss for how to select the same for all entities. Perhaps it's staring me in the face?

Thank you very kindly for taking the time.

Thanks for the great responses. I'll give it a few days to see if a preferred solution bubbles up then I'll select the answer.

UPDATE: I've tried out several of the proposed solutions, the main issue I'm facing now is that if an entity does not yet have a generated score they don't appear in the list.

What would the SQL look like to ensure that all entities are returned, even if they don't have any score posted yet?

UPDATE: Answer selected. Thanks everyone!

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
GloryFish
  • 13,078
  • 16
  • 53
  • 43

7 Answers7

63

I do it this way:

SELECT e.*, s1.score, s1.date_added 
FROM entities e
  INNER JOIN scores s1
    ON (e.id = s1.entity_id)
  LEFT OUTER JOIN scores s2
    ON (e.id = s2.entity_id AND s1.id < s2.id)
WHERE s2.id IS NULL;
Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • 1
    Thanks Bill, I ended up settling on this solution but swapped out the INNER JOIN for a LEFT JOIN to include entities that don't have scores yet. – GloryFish Feb 12 '09 at 22:40
  • 1
    I like this solution, i'm using the LEFT JOIN also. How would you recommend dealing with a tie, in the case that there were two scores for the same entity with the same date? – russds Nov 07 '12 at 20:58
  • I was trying to setup a query in a join and based on a date field so that I would get the result on the most recent record, but not join on a record with a future date. I had to do a sub-query in both the inner join and the left outer join selecting records where my date field was <= GETDATE(). That worked for me. – user1408767 Feb 13 '15 at 14:52
  • @BillKarwin, What is the purpose of `WHERE s2.id IS NULL` ? – dev1998 May 18 '16 at 21:37
  • 3
    @dev1998, s2.id will be NULL only where the OUTER JOIN found no row in s2 that matched the conditions. That is, no row with the same entity_id and an id _greater_ than s1.id. If no such row exists, it must mean that s1 is the one with the greatest id for that entity. – Bill Karwin May 19 '16 at 00:20
10

Just to add my variation on it:

SELECT e.*, s1.score
FROM entities e
INNER JOIN score s1 ON e.id = s1.entity_id
WHERE NOT EXISTS (
    SELECT 1 FROM score s2 WHERE s2.id > s1.id
)
Ray Hidayat
  • 16,055
  • 4
  • 37
  • 43
  • 3
    I like this one! At least on SQL Server, this will run really fast. Now, I'd suggest changing the INNER for a LEFT JOIN, just in case a new entity has just been added and the process hasn't run yet. – Joe Pineda Jan 31 '09 at 04:29
  • 1
    For extra speed, you could put the existence test as part of the joining conditions. At least on SQL S. they're executed before the WHERE filtering is done, so you'd save some milliseconds per row by pruning your search there. – Joe Pineda Jan 31 '09 at 04:34
  • Hmmmm, WHERE filtering is not necessarily done after JOIN clauses. In fact they can be done first, especially if the WHERE clause filters on an INDEX... – MatBailie Jan 31 '09 at 12:07
5

approach 1

SELECT entities.*, 
       scores.score, 
       scores.date_added 
FROM entities

INNER  JOIN scores
ON entities.id = scores.entity_id

WHERE scores.date_added = 
  (SELECT max(date_added) FROM scores where entity_id = entities.id)
Michael Buen
  • 38,643
  • 9
  • 94
  • 118
4

I know this is a old question, just thought I'd add a approach no-one has mentioned yet, Cross Apply or Outer Apply. These are available in SQL Server 2005 (the database type is not tagged in this question) Or higher

Using the temporary tables

DECLARE @Entities TABLE(Id INT PRIMARY KEY, name NVARCHAR(MAX))
INSERT INTO @Entities
VALUES (1, 'Apple'), (2, 'Orange'), (3, 'Banana'), (4, 'Cherry')

DECLARE @Scores TABLE(Id INT PRIMARY KEY, Entity_Id INT, Score INT, Date_Added DATE)
INSERT INTO @Scores
VALUES (1,1,10,'2009-02-01'),
(2,2,10,'2009-02-01'),
(3,1,15,'2009-02-01'),
(4,2,10,'2009-03-01'),
(5,1,15,'2009-04-01'),
(6,2,15,'2009-04-01'),
(7,3,22,'2009-04-01')

You could use

SELECT E.Id, E.name, S.Score, S.Date_Added 
FROM @Entities E
CROSS APPLY
(
    SELECT TOP 1 * 
    FROM @Scores Sc 
    WHERE Sc.Entity_Id = E.Id  
    ORDER BY sc.Score DESC
) AS S

to get the desired results. The equivilent to allow entities without scores would be

SELECT E.Id, E.name, S.Score, S.Date_Added 
FROM @Entities E
OUTER APPLY
(
    SELECT TOP 1 * 
    FROM @Scores Sc 
    WHERE Sc.Entity_Id = E.Id  
    ORDER BY sc.Score DESC
) AS S
Manatherin
  • 4,169
  • 5
  • 36
  • 52
3

approach 2

query cost relative to batch:


SELECT entities.*, 
       scores.score, 
       scores.date_added 
FROM entities

INNER  JOIN scores
ON entities.id = scores.entity_id

inner join 
    (
    SELECT 
           entity_id, max(date_added) as recent_date
    FROM scores
    group by entity_id
    ) as y on entities.id = y.entity_id and scores.date_added = y.recent_date
Community
  • 1
  • 1
Michael Buen
  • 38,643
  • 9
  • 94
  • 118
1

You can also do this today in most RDBMSs (Oracle, PostgreSQL, SQL Server) with a natural query using window functions such as ROW_NUMBER:

SELECT id, name, score, date_added FROM (
 SELECT e.id, e.name, s.score, s.date_added,
 ROW_NUMBER() OVER (PARTITION BY e.id ORDER BY s.date_added DESC) rn
 FROM Entities e INNER JOIN Scores s ON e.id = s.entity_id
) tmp WHERE rn = 1;

SQL Fiddle

Cristian Scutaru
  • 1,375
  • 19
  • 23
1
SELECT entities.*, 
       scores.score, 
       scores.date_added 
FROM entities

INNER  JOIN scores
ON entities.id = scores.entity_id

WHERE entities.id in 
(select id from scores s2 where date_added = max(date_added) and s2.id = entities.id)

ORDER BY scores.date_added DESC
LIMIT 1
Otávio Décio
  • 73,752
  • 17
  • 161
  • 228
  • Your subquery is using a column (date_added) that doesn't exist in the table that you're querying. – Tom H Jan 31 '09 at 02:08