How to return a list of pairs of strings from a large matrix that mutually satisfy a maximum stringdistance criterion?

Question

I am trying to make a way of presenting human-input words in a way that makes their groupings more easily recognisable as referring to the same thing. Essentially a spellchecker. I have gotten as far as making a large matrix (the actual one is 250 * 250 ish). The code for this matrix is identical to the reproducible example given below. (I have populated this with a random word generator, the actual values make much more sense but are confidential)

strings <- c("domineering","curl","axiomatic","root","gratis","secretary","lopsided","cumbersome","oval","mighty","thaw","troubled","furniture","round","soak","callous","melted","wealthy","sweltering","verdant","fence","eyes","ugliest","card","quickest","harm","brake","alarm","report","glue","eyes","hollow","quince","pack","twig","knot")

matrix <- stringdistmatrix(strings, strings, useNames = TRUE)

Now I want to create a new table with two variables, the first column must contain pairs of elements of 'strings' that satisfy the condition that their string-distance was lower than some number lets say for this example (stringdist<7, nonzero), the second column must contain the stringdist. Also the table should not show the reflection of the results present in the matrix e.g. (oval, curl: 3), (curl, oval: 3).

I've got a feeling that this will require an apply function of some sort but I haven't a clue.

Cheers.

rjen · Accepted Answer · 2021-01-18T20:58:29.417

1

The following solution based in the tidyverse should do the trick.

Please note that the last line is in there to make for an easy view of the result. I would not think it to be necessary for your purposes. If you do want to keep it, I would advise to incorporate it in the initial making of 'pair'.

library(stringdist)
library(dplyr)
library(tibble)
library(tidyr)
library(purrr)
library(stringr)

matrix %>%
  as_tibble() %>%
  mutate(X = colnames(.), .before = 1) %>%
  pivot_longer(-X) %>%
  filter(value %in% 1:7) %>%
  transmute(pair = map2(X, name, ~ sort(c(.x, .y))),
            stringDist = value) %>%
  distinct(pair, stringDist) %>%
  mutate(pair = map_chr(pair, ~ str_c(., collapse = '_')))

# A tibble: 451 x 2
#   pair                   stringDist
#   <chr>                       <dbl>
# 1 domineering_sweltering          6
# 2 curl_root                       4
# 3 curl_gratis                     6
# 4 curl_secretary                  7
# 5 cumbersome_curl                 7
# 6 curl_oval                       3
# 7 curl_mighty                     6
# 8 curl_thaw                       4
# 9 curl_troubled                   6
# 10 curl_furniture                 7

edited Jan 18 '21 at 20:58

answered Jan 18 '21 at 20:43

rjen

1,938
1
7
19

Thanks, that solution is great. Ive done a bit more research and found people using the ```melt``` function in ```reshape2```. This can get very close with only two lines of code e.g. : ```X <- melt(X) ; X <- X[X$value <= 4 & X$value != 0,]``` The problem is that this still includes the reflection of the string-pairs, do you have any ideas how to fix that in this method? – Jose_harkhan Jan 19 '21 at 09:29
1

You are very welcome @Jose_harkhan. I'm glad that it solves your problem. I'm not familiar with the use of ```melt()```. For what it is worth: A lot of the work done by the solution I posted is done to remove the reflections. – rjen Jan 19 '21 at 10:31
so I can see! if you could shed any light on how the removal of the reflections works that would be much appreciated. Also I wonder if there is anything that can be done to the matrix that makes it triangular, or fills in one triangle with zeros prior to dyplr (etc) manipulation. – Jose_harkhan Jan 19 '21 at 10:44

score 0 · Answer 2 · answered Jan 19 '21 at 11:00

0

This also does the job

matrix[lower.tri(matrix)] <- 0
matrix_melt <- melt(matrix)
matrix_melt %>%
    filter(value %in% 1:7)

answered Jan 19 '21 at 11:00

Jose_harkhan

57
7

How to return a list of pairs of strings from a large matrix that mutually satisfy a maximum stringdistance criterion?

2 Answers2