How can you compare sets of numbers and get the most relevant results using MySQL and PHP?

Question

Consider this:

set A: 1 2 3 4
set B:     3 4 5 6
set C:       4 5 6 7
set D: 1

I want to compare D with the rest and get as a result a set of numbers as most relevant. The result should be in this order: 4 (as D has a common number with A and 4 is in A and also in B and C), 3 (as D has a common number with A and 3 is in A and B), 2 (as D has a common number with A and 2 is also in A), then 5, 6, 7.

Is there some algorithm to do this in an efficient way in PHP/MySQL? I don't want to reinvent the wheel, and also the database would eventually have a huge number of sets..

Can you word the question better. why is set D in you code set D:1. Is this the result set? — Yada, Dec 09 '09 at 12:55
No, it's just a set I want to compare with all the others so I can get the most relevant numbers. Any set of numbers could be the query to compare with others. i.e. if set D was 1, 7: the relevant should be by order: 4, 5, 6 (because 1, 7 are common to 2 sets which have in common 4, and because 7 is in a set that has more 'connections' as in 4 5 6 with others, 5 and 6 should be the most relevant), 3, 2 (as they have less 'connections' but are still relevant) -> It's complex but I don't know how to phrase it better.. — stagas, Dec 09 '09 at 13:15

score 2 · Answer 1 · answered Dec 09 '09 at 13:19

One example doesn't make a complete specification. For example, how would your answer be different if the collection of sets also included

set E: 1 2 3
set F: 1   3

which would make 3 the most frequently-occurring value among sets that have non-empty intersection with D? So here are my assumptions:

Given a target set (D in your original example):

Values in "overlapping sets" (sets that have non-empty intersection with the target set) are more relevant that values not in those overlapping sets.
Under the constraint of statement 1, relevance is determined by frequency of occurrence.

In your original example, A overlaps with D, so the universe {1, 2, 3, 4, 5, 6, 7} is partitioned into overlapping {1, 2, 3, 4} and non-overlapping {5, 6, 7}. The value frequencies are {1:2, 2:1, 3:2, 4:3, 5:2, 6:2, 7:1}. Combining these facts gives overlapping frequencies {1:2, 2:1, 3:2, 4:3} and non-overlapping frequencies {5:2, 6:2, 7:1}, which produces the order 4, 3, 1, 2 followed by 5, 6, 7. (I notice that you didn't assign a relevance to 1. If deliberate, that can be a final step of removing values of the target set from the final ordering.)

In my adjusted example, the frequencies become {1:4, 2:3, 3:4, 4:3, 5:2, 6:2, 7:1}. That gives overlapping frequencies {1:4, 2:3, 3:4, 4:3} and non-overlapping frequencies {5:2, 6:2, 7:1}, which produces the order 1, 3, 2, 4 followed by 5, 6, 7.

Pseudo-code for this algorithm is:

Initialize overlapping and universe to be empty sets and frequency to be an empty hash.

For each set s in the collection of sets (other than t, the target set):

2.1. Set universe to the union of s and universe

2.2. If s intersected with t has at least one element:

2.2.1. Set `overlapping` to the union of `overlapping` and `s`

2.3. For each element e in s:

2.3.1. If 'e' is a key in `frequency`

    2.3.1.1. Then increase the value (count) for `e` in `frequency` by 1
    2.3.1.2. Else initialize the value (count) for `e` in `frequency` to 1

Set nonOverlapping to the difference of universe and overlapping
Sort the elements of universe by their values in frequency as the first part of the result.
Append to the result the elements of nonOverlapping, also sorted by their values in frequency.

(If you did intend for elements of t to be eliminated, I'd do that as a post-processing step in 4.)

This is very helpful thank you! I was thinking some algorithm in php/mysql code should be already somewhere but I couldn't find anything. I can code this but I don't think my code would be anywhere near fast.. — stagas, Dec 09 '09 at 13:39

score 1 · Accepted Answer · edited May 23 '17 at 12:26

1

In SQL, I'll assume you have a table called sets, with 2 columns, e for the elements and s for the set name.

select e,count(*) as c from sets where s in
(select s from sets where e in (select e from sets where s='D') group by s)
group by e order by c desc

explanation:

(select e from sets where s='D')

selects the elements of group D.

(select s from sets where e in (select e from sets where s='D') group by s)

selects all the groups that have common members with the previously selected group.

and then you select all elements from these sets, and order them by the number of appearances (as joel suggested)

edited May 23 '17 at 12:26

Community

1
1

answered Dec 09 '09 at 13:49

Ofri Raviv

24,375
3
55
55

Excellent. Thank you Ofri. This is exactly what I wanted! One question though: how fast would this query be with all the sets adding elements frequently and sets are added all the time? I guess this is as fast as it can get? – stagas Dec 09 '09 at 14:11
I'm not sure its as fast as possible, as I'm no SQL expert. but its simple enough, so you can just go ahead and try it (with synthetic data). – Ofri Raviv Dec 09 '09 at 14:49

How can you compare sets of numbers and get the most relevant results using MySQL and PHP?

2 Answers2

Linked