How to cluster the similar texts in R

Question

I know similar question might have asked in this/different forum but I feel my requirement is different. I have 2 columns dataframe as shown in below:

Verbatim LowestlevelTerm

Acute Bronchitis Acute Bronchitis

Sinusitis Maxillaris Acuta Acute Maxillary Sinusitis

Increase In Eosinophils Eosinophil Count Increased

Bronchitis Acuta Bronchitis Acute

Acute Sinusitis Maxillaris Acute Sinusitis, Maxillary

Eosinophil Increase Eosinophil Count Increased

Increase In Eosinophilia Eosinophilia

I am trying to get the below output with my code but I am not finding any luck

Verbatim LowestlevelTerm Cluster id

Acute Bronchitis Acute Bronchitis 1

Bronchitis Acuta Bronchitis Acute 1

Sinusitis Maxillaris Acuta Acute Maxillary Sinusitis 2

Acute Sinusitis Maxillaris Acute Sinusitis, Maxillary 2

Increase In Eosinophils Eosinophil Count Increased 3

Eosinophil Increase Eosinophil Count Increased 3

Increase In Eosinophilia Eosinophilia 3

Code which I am using to fulfil my requirement

new_df <- df %>%
  group_by(LowestlevelTerm) %>%
  summarise(Clusterid = toString(ID))

Could you please let me know if there any simple way to cluster this terms using any other functions?

i see you tagged this with `stringdist`. Are you using that package to measure the edit or qgram distance? Is that how you want to cluster? — Chris Umphlett, Nov 25 '19 at 19:40
I am using the stringdist package to calculate the similarity between Verbatim and Lowestlevelterm . It is fine to use any approach to get the requirement done . I tried with dplyr package but no success. Do you have any suggestions/code to do this task? — Pavan kumar, Nov 26 '19 at 04:33

Chris Umphlett · Answer 1 · 2019-11-27T13:47:36.713

I have done something similar and I first create all of the combinations of both terms.

dat<-tibble::tribble(
                     ~Verbatim,             ~LowestlevelTerm,
            "Acute Bronchitis",           "Acute Bronchitis",
  "Sinusitis Maxillaris Acuta",  "Acute Maxillary Sinusitis",
     "Increase In Eosinophils", "Eosinophil Count Increased",
            "Bronchitis Acuta",           "Bronchitis Acute",
  "Acute Sinusitis Maxillaris", "Acute Sinusitis, Maxillary",
         "Eosinophil Increase", "Eosinophil Count Increased",
    "Increase In Eosinophilia",               "Eosinophilia"
  )

dat3 <- merge(dat, dat, by = NULL) %>%
  filter(Verbatim.x != Verbatim.y) %>%
  select(Verbatim.x, LowestlevelTerm.y) %>%
  distinct()

Then I calculate a bunch of different metrics from stringdist. For purposes of this answer I'll show them all, but use the levenshtein edit distance as your "clustering" metric. In other words, will find the minimized lev for each unique combination for each Verbatim.

library(stringdist)
     dat3 <- merge(dat, dat, by = NULL) %>%
      filter(Verbatim.x != Verbatim.y) %>%
      select(Verbatim.x, LowestlevelTerm.y) %>%
      distinct() %>%
  mutate(
    lev = stringdist(Verbatim.x, LowestlevelTerm.y, method = "lv") #like lcs, but permits substitutions
    ,osa = stringdist(Verbatim.x, LowestlevelTerm.y, method = "osa") #lv + transpositions of adjacent characters
    ,dl = stringdist(Verbatim.x, LowestlevelTerm.y, method = "dl") # i think, is similar to osa but can transpose non-adjacent characters
    ,lcs = stringdist(Verbatim.x, LowestlevelTerm.y, method = "lcs") # edit distance using insertions and deletions
    ,qgram = stringdist(Verbatim.x, LowestlevelTerm.y, method = "qgram", q = 2) #counts q-grams that are not shared
    ,cosine = stringdist(Verbatim.x, LowestlevelTerm.y, method = "cosine") # more complicated math than the other q-gram methods
    ,jaccard = stringdist(Verbatim.x, LowestlevelTerm.y, method = "jaccard", q = 2) #compare q-grams, 0 is all matching, 1 is none matching
  ) %>%
  arrange(Verbatim.x, lev)

It's more art than science from this point. Using a cutoff of lev < 15 would seem to work out pretty well to "cluster" things that are similar.

head(dat3, 20)
                   Verbatim.x          LowestlevelTerm.y lev osa dl lcs qgram     cosine   jaccard
1            Acute Bronchitis           Bronchitis Acute  12  12 12  12     4 0.00000000 0.2352941
2            Acute Bronchitis               Eosinophilia  12  12 12  18    24 0.47559558 0.9600000
3            Acute Bronchitis  Acute Maxillary Sinusitis  13  13 13  17    23 0.26902612 0.7419355
4            Acute Bronchitis Acute Sinusitis, Maxillary  16  16 16  20    24 0.27637277 0.7500000
5            Acute Bronchitis Eosinophil Count Increased  23  23 23  30    36 0.27700119 0.9473684
6  Acute Sinusitis Maxillaris           Acute Bronchitis  16  16 16  20    24 0.26893401 0.7419355
7  Acute Sinusitis Maxillaris  Acute Maxillary Sinusitis  17  17 17  19     5 0.02028473 0.1538462
8  Acute Sinusitis Maxillaris           Bronchitis Acute  20  20 20  30    24 0.26893401 0.7419355
9  Acute Sinusitis Maxillaris               Eosinophilia  21  21 21  28    30 0.34684389 0.9062500
10 Acute Sinusitis Maxillaris Eosinophil Count Increased  24  24 24  38    44 0.34461985 0.9347826
11           Bronchitis Acuta           Acute Bronchitis  12  12 12  12     6 0.04545455 0.3333333
12           Bronchitis Acuta               Eosinophilia  13  13 13  16    24 0.42792245 0.9600000
13           Bronchitis Acuta Acute Sinusitis, Maxillary  19  19 19  28    28 0.24622164 0.8235294
14           Bronchitis Acuta Eosinophil Count Increased  20  20 20  26    36 0.30843593 0.9473684
15           Bronchitis Acuta  Acute Maxillary Sinusitis  21  21 21  29    27 0.23856887 0.8181818
16        Eosinophil Increase Eosinophil Count Increased   7   7  7   7     7 0.06910435 0.2800000
17        Eosinophil Increase               Eosinophilia   8   8  8   9    11 0.21106794 0.5500000
18        Eosinophil Increase           Bronchitis Acute  15  15 15  21    29 0.32696355 0.9354839
19        Eosinophil Increase           Acute Bronchitis  17  17 17  25    29 0.32696355 0.9354839
20        Eosinophil Increase  Acute Maxillary Sinusitis  21  21 21  34    36 0.36333027 0.9230769

Thanks for the update. I do not want to calculate the similairty between the terms. I want to cluster the terms. Could you please check once my expected output Verbatim LowestlevelTerm Cluster id Acute Bronchitis Acute Bronchitis 1 Bronchitis Acuta Bronchitis Acute 1 Sinusitis Maxillaris Acuta Acute Maxillary Sinusitis 2 Acute Sinusitis Maxillaris Acute Sinusitis, Maxillary 2 Increase In Eosinophils Eosinophil Count Increased 3 Eosinophil Increase Eosinophil Count Increased 3 — Pavan kumar, Nov 27 '19 at 04:42
I updated the code to reflect that you want clusters of more than one term. Still the same approach. You said previously that you did want to calculate similarity (you didn't say that exactly, but you said you want to use stringdist to cluster. Stringdist calculates similarity). I think you can try different thresholds and pick one that groups things together in a way that you like — Chris Umphlett, Nov 27 '19 at 13:48
Yes, I used Stringdist already for similarity and now I want only to cluster the terms in the below format. I mean to say I need to find how many clusters are there with similar text. Sample Output: Verbatim LowestlevelTerm Cluster id Acute Bronchitis Acute Bronchitis 1 Bronchitis Acuta Bronchitis Acute 1 Sinusitis Maxillaris Acuta Acute Maxillary Sinusitis 2 Acute Sinusitis Maxillaris Acute Sinusitis, Maxillary 2 Could you please help me out on this — Pavan kumar, Nov 29 '19 at 09:53
at this point, I need you to explain more specifically why you cannot adapt what I've already done. Does this work for the clustering? Then accept the answer, or say why it does not. If you need more help to take the data from this format and re-arrange then start another question for that. — Chris Umphlett, Nov 29 '19 at 13:21
The approach does not work for clustering because its calculating only the similarity between the terms and I am trying to find out how many clusters between the terms. It is fine to use any approach if it cannot be done with Stringdist. As you suggest I will try to open a new question by — Pavan kumar, Dec 03 '19 at 09:34

How to cluster the similar texts in R

1 Answers1