Here is an alternative approach which uses a self-join to create the possible combinations of diagnoses for each patient:
library(data.table)
library(magrittr)
co_occ_mat <- function(DT) {
DT[, id := .I] %>%
melt("id", na.rm = TRUE, value.name = "diagnosis") %>%
unique(by = c("id", "diagnosis")) %>%
.[., on = .(id), allow.cartesian = TRUE] %>%
.[diagnosis != i.diagnosis] %>%
dcast(diagnosis ~ i.diagnosis, length)
}
With OP's sample data, co_occ_mat()
returns
fread("V1 V2 V3 V4 V5 V6 V7
A B C D NA NA NA
A E F NA NA NA NA
D A C B F E NA
A E NA NA NA NA NA") %>%
co_occ_mat()
diagnosis A B C D E F
1: A 0 2 2 2 3 2
2: B 2 0 2 2 1 1
3: C 2 2 0 2 1 1
4: D 2 2 2 0 1 1
5: E 3 1 1 1 0 2
6: F 2 1 1 1 2 0
in line with OP's expected result.
The steps in co_occ_mat()
are:
- add an
id
column for each row, i.e. patient
- reshape to long format
- remove any duplicates in case a diagnosis is reported more than once for a patient
- create pairs of diagnoses by a cartesian self-join for each
id
- remove the trivial cases of pairs where both diagnoses are equal
- create the co-occurrence matrix by reshaping to wide format and counting the patients
Using the data from Roman's answer
RNGversion("3.6.0")
set.seed(357)
matrix(sample(LETTERS[1:15], size = 80, replace = TRUE), nrow = 8) %>%
as.data.table() %T>% print() %>%
co_occ_mat()
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1: G F M N D G N H K K
2: H I C K H E H E I G
3: G C C L N F M K C E
4: A K G O I C C B O I
5: K O E B M O F C L N
6: D H K H I N B F A H
7: J N D J L K M A O O
8: J D I M O H N O H H
we get
diagnosis A B C D E F G H I J K L M N O
1: A 0 2 1 2 0 1 1 1 2 1 3 1 1 2 2
2: B 2 0 2 1 1 2 1 1 2 0 3 1 1 2 2
3: C 1 2 0 0 3 2 3 1 2 0 4 2 2 2 2
4: D 2 1 0 0 0 2 1 3 2 2 3 1 3 4 2
5: E 0 1 3 0 0 2 2 1 1 0 3 2 2 2 1
6: F 1 2 2 2 2 0 2 2 1 0 4 2 3 4 1
7: G 1 1 3 1 2 2 0 2 2 0 4 1 2 2 1
8: H 1 1 1 3 1 2 2 0 3 1 3 0 2 3 1
9: I 2 2 2 2 1 1 2 3 0 1 3 0 1 2 2
10: J 1 0 0 2 0 0 0 1 1 0 1 1 2 2 2
11: K 3 3 4 3 3 4 4 3 3 1 0 3 4 5 3
12: L 1 1 2 1 2 2 1 0 0 1 3 0 3 3 2
13: M 1 1 2 3 2 3 2 2 1 2 4 3 0 5 3
14: N 2 2 2 4 2 4 2 3 2 2 5 3 5 0 3
15: O 2 2 2 2 1 1 1 1 2 2 3 2 3 3 0
For some reason which I do not understand yet it is required to call RNGversion("3.6.0")
before set.seed(357)
in order to reproduce Roman's random numbers.
Note that this test case contains duplicate diagnoses per patient, e.g., K
in row 1.