0

I'm designing a simple messaging schema where a thread groups all messages that are sent between a collection of users. I'm getting stuck when I have to find an existing thread given a set of users.

There are 2 scenarios for sending a message:

Send To Thread: When viewing a thread, a message is sent directly to that thread, so the threadID is known. (not a problem)

Send To Recipients: A user creates a new message and specifies a set of recipients from scratch. I only want to create a new thread if one doesn't already exist between these users, which is where I'm stuck. I need a query that will find an existing threadID given a set of users. The ThreadMembers table maps users to threads. Is this even possible? Or do I need to modify my tables?

My tables:

Thread:
threadID (id)
lastSent (timestamp)

ThreadMembers:
threadFK (foreign key to thread)
userFK (foreign key to user)

Messages:
threadFK (foreign key to thread)
senderFK (foreign key to user)
msgID (id)
msgDate (timestamp)
msgText (text)

Thank you very much!

ruakh
  • 175,680
  • 26
  • 273
  • 307
Redtopia
  • 4,947
  • 7
  • 45
  • 68
  • Maybe something like: SELECT threadFK FROM ThreadMembers WHERE and userFK IN (user1,user2,user3,user4) – Redtopia Mar 28 '12 at 15:54
  • Which DBMS? (Oracle? PostgreSQL? MySQL? SQL Server?) – ruakh Mar 28 '12 at 15:55
  • By the way -- does this approach mean that every single message from user A to user B or vice versa will belong to a single thread, even if those messages span multiple years? That's a bit counterintuitive. – ruakh Mar 28 '12 at 15:56
  • SQL Server... and yes, the thread will span forever. A single thread will tie all messages together given any recipient group, whether it's between 2 people, or 200 people. – Redtopia Mar 28 '12 at 16:00

4 Answers4

1

I don't recommend this lightly, but I think you'd be better off denormalizing slightly by adding a column to Thread that contains a comma-separated sorted list of foreign keys to User. And indexing that column. Then your application just has to sort the user-IDs of the sender + all recipients, join the sorted list with commas, and look up the Thread record.

Since — by definition — the list of users in a thread never changes, you just need to populate these things correctly on insert, and you don't have to worry about later updates being consistent.

(To be clear: what you describe is definitely possible with a properly normalized schema. But it will be ugly, and I think it will perform poorly.)

ruakh
  • 175,680
  • 26
  • 273
  • 307
  • Interesting approach... I tend to avoid fields that contain comma separated values, but I see how this could be efficient. – Redtopia Mar 28 '12 at 16:12
  • After thinking about this, I'm not sure this would be easy, efficient, or less ugly. What if there are 100 recipients in your thread? The recipients field would be a long, comma separated list. You would have to ensure that each recipient would be in that list, which (I think) would be very ugly. – Redtopia Mar 28 '12 at 16:33
  • @Redtopia: I don't know what you mean by "You would have to ensure that each recipient would be in that list" -- you're saying that two messages belong to the same thread if their user-lists are *exactly* the same, so this is just a matter of plain old string equality (`WHERE comma_separated_sorted_list = '...'`). If you're concerned about length, an alternative is to store an MD5 or SHA-1 hash of the list rather than the list itself, or to store such a hash in addition to the first 100 characters of the list, or whatnot. – ruakh Mar 28 '12 at 17:07
  • OK, so the list would need to be sorted. That makes sense. – Redtopia Mar 28 '12 at 19:07
1

EDIT:

I realized, in the course of attempting to explain the query, that it wouldn't always work correctly. So, I went back and figured out how to test this. I'm still bugged by the schema setup - namely, it implies that new users can't be added to an existing thread, and that a specific set of users will only be able to talk in one thread - but it was good to correct the query.

WITH Selected_Users(id) as (VALUES (@id1), (@id2), --etc--),
     Threads(id) as (SELECT DISTINCT threadFk
                     FROM ThreadMembers as a
                     JOIN Selected_Users as b
                     ON b.id = a.userFk)
SELECT a.id
FROM Threads as a
WHERE NOT EXISTS (SELECT '1'
                  FROM ThreadMembers as b
                  LEFT JOIN Selected_Users as c
                  ON c.id = b.userFk
                  WHERE c.id IS NULL
                  AND b.threadFk = a.id)
AND NOT EXISTS (SELECT '1'
                FROM Selected_Users as b
                LEFT JOIN ThreadMembers as c
                ON c.userFk = b.id
                AND c.threadFk = a.id
                WHERE c.userFk IS NULL) 

The statement will likely have to be dynamic, to build the list of selected users, unless SQL Server has a way to provide a list as a host variable (I know DB2 does, at least from the iSeries). I don't have the perfect dataset to test this against, but against a multi-million row table (with only a many-one relationship), it returns almost instantly - I'm getting index-only access for this (hint hint).

Explanations:

WITH Selected_Users(id) as (VALUES (@id1), (@id2), --etc--),

This CTE is building the list of users so that it can be referenced as a table. This makes it easiest to deal with, although it would be possible to simply replaces it with an IN statement everywhere (requires multiple references, though).

     Threads(id) as (SELECT DISTINCT threadFk
                     FROM ThreadMembers as a
                     JOIN Selected_Users as b
                     ON b.id = a.userFk)

This CTE gets the list of (distinct) threads that the users are involved in. Mostly, this is just to chop the listing down to single references to threadFk.

SELECT a.id
FROM Threads as a

... Get the selected set of threads ...

WHERE NOT EXISTS (SELECT '1'
                  FROM ThreadMembers as b
                  LEFT JOIN Selected_Users as c
                  ON c.id = b.userFk
                  WHERE c.id IS NULL
                  AND b.threadFk = a.id)

Where there isn't anybody 'missing' from the selected list of users - that is, it eliminates threads with user-lists that are subsets of a larger one. It also eliminates threads that have some of the users listed from the selection, but also a few that aren't, meaning that the counts of the users would match, but the actual users would not (this is where my first version failed).


EDIT:

I realized that, while the existing statement takes care of the situation where the provided list of users is a subset of users listed for a given thread, I didn't take care of the situation where the list of selected users contains a subset that is the list of users for the given thread.

AND NOT EXISTS (SELECT '1'
                FROM Selected_Users as b
                LEFT JOIN ThreadMembers as c
                ON c.userFk = b.id
                AND c.threadFk = a.id
                WHERE c.userFk IS NULL) 

This clause fixes that. It makes sure that there aren't any leftover users in the selection list, after excluding users for a particular thread.

The statement is now bugging me a bit - there may be a slightly better way for me to do this...


EDIT:

Muwahaha, there is a COUNT(*) version, which should also be faster:

WITH Selected_Users(id) as (VALUES (@id1), (@id2), --etc--),
SELECT a.threadFk
FROM ThreadMembers as a
JOIN Selected_Users as b
ON b.id = a.userFk
GROUP BY a.threadFk
HAVING COUNT(*) = (SELECT COUNT(*) FROM Selected_Users)
AND COUNT(*) = (SELECT COUNT(*) from ThreadMembers as c
                WHERE c.threadFk = a.threadFk)

Explanations:

SELECT a.threadFk
FROM ThreadMembers as a
JOIN Selected_Users as b
ON b.id = a.userFk

This is joining to get all threads the listed members are a part of. This is the inside equivalent to the Threads CTE above. Actually, you could remove that CTE in the above query, too.

GROUP BY a.threadFk

We only want one instance of a given thread after all. Also (in DB2 at least), the rest of the statement isn't valid unless it's present.

HAVING COUNT(*) = (SELECT COUNT(*) FROM Selected_Users)

Verify that, for the given thread, all of the selected users are present. Or, all of the selected users must be present in the given thread.

AND COUNT(*) = (SELECT COUNT(*) from ThreadMembers as c
                WHERE c.threadFk = a.threadFk)

Verify that, for the given thread, there are no non-selected users. Or, there must not be any users 'left out'

You should get index-only access for this (I seem to be). The COUNT(*) of the result rows (for the GROUP BY) should only be performed once, and reused. The HAVING clause is evaluated after the GROUP BY takes place (if I recall correctly), so the sub-select for the count from the original table should only take place once per threadFk.

Clockwork-Muse
  • 12,806
  • 6
  • 31
  • 45
  • Thanks X-Zero! I'm trying to wrap my head around the query... though I've written plenty of SQL, I'm by no means an expert and these kinds of queries are confusing to me. Can you add a comment that briefly explains what the query is doing? And, do you foresee any performance issues if there are millions of threads? – Redtopia Mar 28 '12 at 16:27
  • A thread is defined as a conversation between a set of users. I hadn't thought about new users being added to a thread, but why not? Seems like a good feature. I'm going to try this out, but I'm not sure where it will take me. However, I won't be able to get to this for a few days... in the meantime, I'm giving you point for you help! – Redtopia Mar 28 '12 at 19:13
  • I'm trying to write this query in SQL Server 2008 and I'm getting some errors. The first part is building a table from the IDs of the users, is that correct? I believe that syntax is different in TSQL. Do you have any idea how to write this in TSQL? – Redtopia Mar 28 '12 at 20:32
  • Yes, the first part is just getting a table of the users. Sorry, I don't really know. I thought that your host-variables were designated by `'@'`, but I could be wrong. If you're constructing it dynamically, you should just be able to list the id's. That exact statement (well, with host variables replaced) worked on my (DB2) system, and I _thought_ it would be universal. – Clockwork-Muse Mar 28 '12 at 21:08
  • OK, I think I got the temp table going, but I'm having some problems with the column names. The second line: Threads(id)... id is not a field in Threads... shouldn't it be Threads(threadID)? – Redtopia Mar 28 '12 at 22:24
  • Got it working... seems to work. I posted a TSQL version below. Thank you! – Redtopia Mar 28 '12 at 22:47
  • Sorry, I missed that I'd named one of my CTEs the same as a table - that's going to cause collisions. I recommend you use (or at least look at) the `COUNT(*)` version I posted at the bottom - it's liable to be faster, and doesn't have the collision problem. – Clockwork-Muse Mar 28 '12 at 22:57
0

Is it correct to say you are interested whether any thread exists that: 1) has the same count in threadmembers, when grouped by threadFK, as the number of members of the group you are interested in, 2) has and link to each member? If so, I think a solution will follow from there (so this is a proposed answer). Exact mechanics would vary with what brand of database you are using, oracle, postgres or sql server probably would be simpler than other brands. How do you want to call the thing, as a stored procedure that takes a table of users, a list of user names, and returns, what, the key if there's a match, or NULL?

Levin Magruder
  • 1,905
  • 18
  • 26
  • There will be multiple threads with the same user count, so I'm not sure that's going to help. Seems like it should be simple, though as with any messaging system, it also has to be efficient. Given a set of users, I need to see if a thread exists between them. If not, I will create one. There should only be a single thread between any group of users. Assume there will be millions of threads. I haven't implemented anything yet, so an answer can suggest modifications to my tables. – Redtopia Mar 28 '12 at 16:09
0

Here's an example of the answer (answer number 1) using MS SQL Server 2008. This assumes that table: MessageThreadUsers (threadFK - int, userFK - varchar) is defined (your key types may be different):

DELETE FROM MessageThreadUsers
GO

INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (1, 'user1')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (1, 'user2')

INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (2, 'user1')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (2, 'user2')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (2, 'user3')

INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (3, 'user1')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (3, 'user2')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (3, 'user3')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (3, 'user4')

INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (4, 'user1')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (4, 'user2')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (4, 'user3')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (4, 'user4')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (4, 'user5')

INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (5, 'user1')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (5, 'user2')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (5, 'user3')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (5, 'user4')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (5, 'user5')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (5, 'user6')

INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (6, 'user6')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (6, 'user3')
INSERT INTO MessageThreadUsers (threadFK, userFK) VALUES (6, 'user1')

GO

WITH Selected_Users (id) AS (
    SELECT 'user3' UNION
    SELECT 'user1' UNION
    SELECT 'user6'
)
SELECT a.threadFk
FROM MessageThreadUsers as a
JOIN Selected_Users as b
ON b.id = a.userFk
GROUP BY a.threadFk
HAVING COUNT(*) = (SELECT COUNT(*) FROM Selected_Users)
AND COUNT(*) = (SELECT COUNT(*) from MessageThreadUsers as c
                WHERE c.threadFk = a.threadFk)
Redtopia
  • 4,947
  • 7
  • 45
  • 68