I have done something similar and I first create all of the combinations of both terms.
dat<-tibble::tribble(
~Verbatim, ~LowestlevelTerm,
"Acute Bronchitis", "Acute Bronchitis",
"Sinusitis Maxillaris Acuta", "Acute Maxillary Sinusitis",
"Increase In Eosinophils", "Eosinophil Count Increased",
"Bronchitis Acuta", "Bronchitis Acute",
"Acute Sinusitis Maxillaris", "Acute Sinusitis, Maxillary",
"Eosinophil Increase", "Eosinophil Count Increased",
"Increase In Eosinophilia", "Eosinophilia"
)
dat3 <- merge(dat, dat, by = NULL) %>%
filter(Verbatim.x != Verbatim.y) %>%
select(Verbatim.x, LowestlevelTerm.y) %>%
distinct()
Then I calculate a bunch of different metrics from stringdist
. For purposes of this answer I'll show them all, but use the levenshtein edit distance as your "clustering" metric. In other words, will find the minimized lev
for each unique combination for each Verbatim.
library(stringdist)
dat3 <- merge(dat, dat, by = NULL) %>%
filter(Verbatim.x != Verbatim.y) %>%
select(Verbatim.x, LowestlevelTerm.y) %>%
distinct() %>%
mutate(
lev = stringdist(Verbatim.x, LowestlevelTerm.y, method = "lv") #like lcs, but permits substitutions
,osa = stringdist(Verbatim.x, LowestlevelTerm.y, method = "osa") #lv + transpositions of adjacent characters
,dl = stringdist(Verbatim.x, LowestlevelTerm.y, method = "dl") # i think, is similar to osa but can transpose non-adjacent characters
,lcs = stringdist(Verbatim.x, LowestlevelTerm.y, method = "lcs") # edit distance using insertions and deletions
,qgram = stringdist(Verbatim.x, LowestlevelTerm.y, method = "qgram", q = 2) #counts q-grams that are not shared
,cosine = stringdist(Verbatim.x, LowestlevelTerm.y, method = "cosine") # more complicated math than the other q-gram methods
,jaccard = stringdist(Verbatim.x, LowestlevelTerm.y, method = "jaccard", q = 2) #compare q-grams, 0 is all matching, 1 is none matching
) %>%
arrange(Verbatim.x, lev)
It's more art than science from this point. Using a cutoff of lev < 15 would seem to work out pretty well to "cluster" things that are similar.
head(dat3, 20)
Verbatim.x LowestlevelTerm.y lev osa dl lcs qgram cosine jaccard
1 Acute Bronchitis Bronchitis Acute 12 12 12 12 4 0.00000000 0.2352941
2 Acute Bronchitis Eosinophilia 12 12 12 18 24 0.47559558 0.9600000
3 Acute Bronchitis Acute Maxillary Sinusitis 13 13 13 17 23 0.26902612 0.7419355
4 Acute Bronchitis Acute Sinusitis, Maxillary 16 16 16 20 24 0.27637277 0.7500000
5 Acute Bronchitis Eosinophil Count Increased 23 23 23 30 36 0.27700119 0.9473684
6 Acute Sinusitis Maxillaris Acute Bronchitis 16 16 16 20 24 0.26893401 0.7419355
7 Acute Sinusitis Maxillaris Acute Maxillary Sinusitis 17 17 17 19 5 0.02028473 0.1538462
8 Acute Sinusitis Maxillaris Bronchitis Acute 20 20 20 30 24 0.26893401 0.7419355
9 Acute Sinusitis Maxillaris Eosinophilia 21 21 21 28 30 0.34684389 0.9062500
10 Acute Sinusitis Maxillaris Eosinophil Count Increased 24 24 24 38 44 0.34461985 0.9347826
11 Bronchitis Acuta Acute Bronchitis 12 12 12 12 6 0.04545455 0.3333333
12 Bronchitis Acuta Eosinophilia 13 13 13 16 24 0.42792245 0.9600000
13 Bronchitis Acuta Acute Sinusitis, Maxillary 19 19 19 28 28 0.24622164 0.8235294
14 Bronchitis Acuta Eosinophil Count Increased 20 20 20 26 36 0.30843593 0.9473684
15 Bronchitis Acuta Acute Maxillary Sinusitis 21 21 21 29 27 0.23856887 0.8181818
16 Eosinophil Increase Eosinophil Count Increased 7 7 7 7 7 0.06910435 0.2800000
17 Eosinophil Increase Eosinophilia 8 8 8 9 11 0.21106794 0.5500000
18 Eosinophil Increase Bronchitis Acute 15 15 15 21 29 0.32696355 0.9354839
19 Eosinophil Increase Acute Bronchitis 17 17 17 25 29 0.32696355 0.9354839
20 Eosinophil Increase Acute Maxillary Sinusitis 21 21 21 34 36 0.36333027 0.9230769