I am using Micorosft SQL Server. I have two tables, t1 and t2, that each consist of the following set of variables: PatientID, AdmissionDate, DiagnosisCode. Note that multiple diagnoses within an admission are shown as multiple rows. Each table shows a different list of patients. These tables are large so the solution has to be efficient (400,000 rows). I would like to calculate the similarity of patients in table 1 to the patients in table 2. Similarity is defined as ratio of number of diagnoses the two patients share divided by the following sum:
.8*(number of diagnosis of the patient in table 1 that is not matched to patient in table 2) + .2*(number of diagnoses of patient in table 2 that is not matched to the patient in table 1) + (number of diagnoses the two patients share)
Any suggestions of how to organize this problem is appreciated.