How to count matches across two sets of data

Question

I am using Micorosft SQL Server. I have two tables, t1 and t2, that each consist of the following set of variables: PatientID, AdmissionDate, DiagnosisCode. Note that multiple diagnoses within an admission are shown as multiple rows. Each table shows a different list of patients. These tables are large so the solution has to be efficient (400,000 rows). I would like to calculate the similarity of patients in table 1 to the patients in table 2. Similarity is defined as ratio of number of diagnoses the two patients share divided by the following sum:

.8*(number of diagnosis of the patient in table 1 that is not matched to patient in table 2) + .2*(number of diagnoses of patient in table 2 that is not matched to the patient in table 1) + (number of diagnoses the two patients share)

Any suggestions of how to organize this problem is appreciated.

Here is how I approached it using cross join: From cross join to find the diagnoses that match each other. Then calculate for each set of cases the number of matches and mismatches. The solution however seems to be very time consuming. — user2001212, Feb 27 '13 at 22:07

score 0 · Answer 1 · answered Feb 28 '13 at 16:49

Here is my attempt at solving this problem and I hope others can find more efficient ways:

select #t1.id1, #t1.adm1, #t1.dx1, #t2.id2, #t2.adm2, #t2.dx2, iif(#t1.dx1=#t2.dx2,1,0) as shared Into #t3 From #t1 cross join #t2
Select id1, adm1, dx1, id2, adm2, sum(shared) as In1In2, iif(sum(shared)=0,1,0) as In1Not2 into #t4 from #t3 group by id1, adm1, dx1, id2, adm2 
Select id1, adm1, dx1, id2, adm2, sum(In1Not2) as nIn1Not2, into #t5 from #t4 group by id1, adm1, id2, adm2 
Select id1, adm1, dx2, id2, adm2, iif(sum(shared)=0,1,0) as In2Not1 into #t6 from #t3 group by id1, adm1, dx2, id2, adm2 
Select id1, adm1, id2, adm2, sum(In2Not1) as nIn2Not1 into #t7 from #t6 group by id1, adm1, id2, adm2

In the next step the calculated values are combined into a common table. The problem with this attempt is that running it on t1 of 100,000 and t2 of 400,000 records is taking more than 2 hours.

How to count matches across two sets of data

1 Answers1