2

I have a dataframe that is comparable to the one below:

V1 V2 V3 V4 V5 V6 V7
 A B  C  D  NA NA NA
 A E  F  NA NA NA NA
 D A  C  B  F  E  NA
 A E  NA NA NA NA NA

Each row is a patient and each letter in the dataframe represents a specific diagnosis.

I want to find how often specific diagnoses occur together, e.g. How many times does diagnosis A occur with diagnosis E row-wise? (Three times).

I am hoping to produce a matrix like this:
  A B C D E F
A 0 2 2 2 3 2
B 2 0 2
C 2 2 0 etc etc
D 2
E 3
F 2

(I have not completely filled it out)

It is essentially an adjacency matrix except that the observations don't need to be directly adjacent, they just need to be on the same row.

From here I would then produce a chorddiagram.

Thank you for any help!

medic12345
  • 23
  • 2

3 Answers3

2

Here is an alternative approach which uses a self-join to create the possible combinations of diagnoses for each patient:

library(data.table)
library(magrittr)
co_occ_mat <- function(DT) {
  DT[, id := .I] %>% 
    melt("id", na.rm = TRUE, value.name = "diagnosis") %>% 
    unique(by = c("id", "diagnosis")) %>% 
    .[., on = .(id), allow.cartesian = TRUE] %>% 
    .[diagnosis != i.diagnosis] %>% 
    dcast(diagnosis ~ i.diagnosis, length)
} 

With OP's sample data, co_occ_mat() returns

fread("V1 V2 V3 V4 V5 V6 V7
 A B  C  D  NA NA NA
 A E  F  NA NA NA NA
 D A  C  B  F  E  NA
 A E  NA NA NA NA NA") %>% 
  co_occ_mat()
   diagnosis A B C D E F
1:         A 0 2 2 2 3 2
2:         B 2 0 2 2 1 1
3:         C 2 2 0 2 1 1
4:         D 2 2 2 0 1 1
5:         E 3 1 1 1 0 2
6:         F 2 1 1 1 2 0

in line with OP's expected result.

The steps in co_occ_mat() are:

  1. add an id column for each row, i.e. patient
  2. reshape to long format
  3. remove any duplicates in case a diagnosis is reported more than once for a patient
  4. create pairs of diagnoses by a cartesian self-join for each id
  5. remove the trivial cases of pairs where both diagnoses are equal
  6. create the co-occurrence matrix by reshaping to wide format and counting the patients

Using the data from Roman's answer

RNGversion("3.6.0")
set.seed(357)
matrix(sample(LETTERS[1:15], size = 80, replace = TRUE), nrow = 8) %>% 
  as.data.table() %T>% print() %>% 
  co_occ_mat()
   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1:  G  F  M  N  D  G  N  H  K   K
2:  H  I  C  K  H  E  H  E  I   G
3:  G  C  C  L  N  F  M  K  C   E
4:  A  K  G  O  I  C  C  B  O   I
5:  K  O  E  B  M  O  F  C  L   N
6:  D  H  K  H  I  N  B  F  A   H
7:  J  N  D  J  L  K  M  A  O   O
8:  J  D  I  M  O  H  N  O  H   H

we get

    diagnosis A B C D E F G H I J K L M N O
 1:         A 0 2 1 2 0 1 1 1 2 1 3 1 1 2 2
 2:         B 2 0 2 1 1 2 1 1 2 0 3 1 1 2 2
 3:         C 1 2 0 0 3 2 3 1 2 0 4 2 2 2 2
 4:         D 2 1 0 0 0 2 1 3 2 2 3 1 3 4 2
 5:         E 0 1 3 0 0 2 2 1 1 0 3 2 2 2 1
 6:         F 1 2 2 2 2 0 2 2 1 0 4 2 3 4 1
 7:         G 1 1 3 1 2 2 0 2 2 0 4 1 2 2 1
 8:         H 1 1 1 3 1 2 2 0 3 1 3 0 2 3 1
 9:         I 2 2 2 2 1 1 2 3 0 1 3 0 1 2 2
10:         J 1 0 0 2 0 0 0 1 1 0 1 1 2 2 2
11:         K 3 3 4 3 3 4 4 3 3 1 0 3 4 5 3
12:         L 1 1 2 1 2 2 1 0 0 1 3 0 3 3 2
13:         M 1 1 2 3 2 3 2 2 1 2 4 3 0 5 3
14:         N 2 2 2 4 2 4 2 3 2 2 5 3 5 0 3
15:         O 2 2 2 2 1 1 1 1 2 2 3 2 3 3 0

For some reason which I do not understand yet it is required to call RNGversion("3.6.0") before set.seed(357) in order to reproduce Roman's random numbers.

Note that this test case contains duplicate diagnoses per patient, e.g., K in row 1.

Uwe
  • 41,420
  • 11
  • 90
  • 134
1

I thought it would be fun to construct this by hand. The algorithm is pretty simple. For each patient find which diagnoses co-occur and write that to an upper triangle matrix.

set.seed(357)
xy <- matrix(sample(LETTERS[1:15], size = 80, replace = TRUE), nrow = 8)

> head(xy)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] "G"  "F"  "M"  "N"  "D"  "G"  "N"  "H"  "K"  "K"  
[2,] "H"  "I"  "C"  "K"  "H"  "E"  "H"  "E"  "I"  "G"  
[3,] "G"  "C"  "C"  "L"  "N"  "F"  "M"  "K"  "C"  "E"  
[4,] "A"  "K"  "G"  "O"  "I"  "C"  "C"  "B"  "O"  "I"  
[5,] "K"  "O"  "E"  "B"  "M"  "O"  "F"  "C"  "L"  "N"  
[6,] "D"  "H"  "K"  "H"  "I"  "N"  "B"  "F"  "A"  "H" 

# Find all unique diagnoses.
all.diagnoses <- unique(as.vector(xy))
all.diagnoses <- sort(as.character(all.diagnoses))

# This is a way of creating an empty matrix.
out <- matrix(rep(NA, length(all.diagnoses)^2), nrow = length(all.diagnoses),
              dimnames = list(all.diagnoses, all.diagnoses))

for (i in 1:nrow(xy)) {
  combinations <- combn(unique(xy[i, ]), m = 2, simplify = FALSE)
  for (j in 1:length(combinations)) {
    # Add occurrence of each combination to the corresponding combination.
    com <- sort(combinations[[j]])
    out[com[1], com[2]]  <- sum(out[com[1], com[2]], 1, na.rm = TRUE)
  }
}

> out
   A  B  C  D  E  F  G  H  I  J  K  L  M  N  O
A NA  2  1  2 NA  1  1  1  2  1  3  1  1  2  2
B NA NA  2  1  1  2  1  1  2 NA  3  1  1  2  2
C NA NA NA NA  3  2  3  1  2 NA  4  2  2  2  2
D NA NA NA NA NA  2  1  3  2  2  3  1  3  4  2
E NA NA NA NA NA  2  2  1  1 NA  3  2  2  2  1
F NA NA NA NA NA NA  2  2  1 NA  4  2  3  4  1
G NA NA NA NA NA NA NA  2  2 NA  4  1  2  2  1
H NA NA NA NA NA NA NA NA  3  1  3 NA  2  3  1
I NA NA NA NA NA NA NA NA NA  1  3 NA  1  2  2
J NA NA NA NA NA NA NA NA NA NA  1  1  2  2  2
K NA NA NA NA NA NA NA NA NA NA NA  3  4  5  3
L NA NA NA NA NA NA NA NA NA NA NA NA  3  3  2
M NA NA NA NA NA NA NA NA NA NA NA NA NA  5  3
N NA NA NA NA NA NA NA NA NA NA NA NA NA NA  3
O NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Roman Luštrik
  • 69,533
  • 24
  • 154
  • 197
0

Here is another base R option using table:

pairs <- as.data.frame(do.call(rbind, 
    apply(dat, 1L, function(x) t(combn(na.omit(x), 2L)))))

tab <- table(pairs)
ut <- tab
ut[lower.tri(tab)] <- 0L
lt <- tab
lt[upper.tri(tab)] <- 0L
ans <- t(lt) + ut
ans + t(ans)

output:

   V1
V2  A B C D E F
  A 0 2 2 2 3 2
  B 2 0 2 2 1 1
  C 2 2 0 2 1 1
  D 2 2 2 0 1 1
  E 3 1 1 1 0 2
  F 2 1 1 1 2 0

data:

dat <- read.table(text="V1 V2 V3 V4 V5 V6 V7
A B  C  D  NA NA NA
A E  F  NA NA NA NA
D A  C  B  'F'  E  NA
A E  NA NA NA NA NA", header=TRUE, colClasses="character")
chinsoon12
  • 25,005
  • 4
  • 25
  • 35