Nearest string match and their rowId

Question

i am trying to compare col_1 in df_1 dataframe with col_2 in df_2 dataframe to get nearest top 3 match with least score(least score represents nearest match) and their respective rowid. Also is there any flexibility to change top N nearest matches.i.e in my case i have considered top 3 and to chane like top 5,top 10 and so on

col_1 = c("My name is john","The best ever Puma wishlist",  "i have been mailing my issue daily",
          "Its perfect for a day at gym")


col_2 =    c("My name is jon","My Name is jhn", "My Nam is mark", "Mu Name is John",
          "John is my name", "Its perfect for a day at gym&outside",  "Its perfect for a outside",
          "Its perfect day at gym", "Its perfect for a day at gm", "My name is john" )

row_id = c(1,2,3,4,5,6,7,8,9,10)

df_1 = data.frame(col_1)
df_2 = data.frame(col_2,row_id)

Final out df should be

col_1  = c("My name is john","The best ever Puma wishlist","i have been mailing my issue daily","Its perfect for a day at gym")
nearest_1 = c("My name is John","Its perfect for a outside","Its perfect for a day at gym&outside","Its perfect for a day at gm")
nearest_1_row_id = c(10,7,6,9)
nearest_2 = c("My name is jon","Its perfect for a day at gym&outside","John is my name","Its perfect for a day at gym&outside")
nearest_2_row_id = c(1,6,5,6)
nearest_3 = c("My Name is jhn","Its perfect day at gym","My name is john","Its perfect day at gym")
nearest_3_row_id = c(3,8,4,8)

**df_1_out = data.frame(col_1,nearest_1,nearest_1_row_id,nearest_2,nearest_2_row_id,nearest_3,nearest_3_row_id)**

I have tried with

library(stringdist)
df_1_out = df_1
df_1_out$nearest_1 = stringdist("My name is john","My name is jon",  method = 'jw')

Like wise i need compare each and every row. Is there any alternative method to achive the required output.

score 0 · Answer 1 · answered Sep 29 '21 at 08:02

here is a go at things..

library(stringdist)
library(data.table)
ans <- as.data.table(
  stringdist::stringdistmatrix(df_1$col_1, df_2$col_2, useNames = TRUE, method = "jw"), 
  keep.rownames = TRUE)
ans.melt <- melt(ans, id.vars = "rn", value.name = "stringdist")
#set order of value
setorder(ans.melt, rn, stringdist)
#select top 3, join rownumbers
ans.melt[, .SD[1:3], by = .(rn)][setDT(df_2), row := i.row_id, on = .(variable = col_2)][]
#                                    rn                             variable stringdist row
# 1:       Its perfect for a day at gym          Its perfect for a day at gm 0.01190476   9
# 2:       Its perfect for a day at gym Its perfect for a day at gym&outside 0.07407407   6
# 3:       Its perfect for a day at gym               Its perfect day at gym 0.10930736   8
# 4:                    My name is john                      My name is john 0.00000000  10
# 5:                    My name is john                       My name is jon 0.02222222   1
# 6:                    My name is john                       My Name is jhn 0.06825397   2
# 7:        The best ever Puma wishlist            Its perfect for a outside 0.45001764   7
# 8:        The best ever Puma wishlist Its perfect for a day at gym&outside 0.48703704   6
# 9:        The best ever Puma wishlist               Its perfect day at gym 0.48947811   8
#10: i have been mailing my issue daily Its perfect for a day at gym&outside 0.41040305   6
#11: i have been mailing my issue daily                      John is my name 0.42124183   5
#12: i have been mailing my issue daily                      My name is john 0.45074272  10

thanks, but in actual i have huge number of rows in both data frames and getting error due to matrix type — san1, Sep 29 '21 at 09:00

Park · Answer 2 · 2021-09-29T08:35:10.073

data.frame(col_1,nearest_1,nearest_1_row_id,nearest_2,nearest_2_row_id,nearest_3,nearest_3_row_id)

                                                                col_1 nearest_1 nearest_1_row_id nearest_2
My name is john                                       My name is john        10               10         1
The best ever Puma wishlist               The best ever Puma wishlist         7                7         6
i have been mailing my issue daily i have been mailing my issue daily         6                6         5
Its perfect for a day at gym             Its perfect for a day at gym         9                9         6
                                   nearest_2_row_id nearest_3 nearest_3_row_id
My name is john                                   1         2                2
The best ever Puma wishlist                       6         8                8
i have been mailing my issue daily                5        10               10
Its perfect for a day at gym                      6         8                8

It's not that fancy but

nearest_1_row_id   <- sapply(col_1, function(x) a =which(rank(stringdist(x, df_2$col_2, method = "jw")) == 1)) 
nearest_2_row_id   <- sapply(col_1, function(x) a =which(rank(stringdist(x, df_2$col_2, method = "jw")) == 2)) 
nearest_3_row_id  <- sapply(col_1, function(x) a =which(rank(stringdist(x, df_2$col_2, method = "jw")) == 3)) 

x1 <- data.frame(nearest_1_row_id) %>% rownames_to_column("nearest_1")
x2 <- data.frame(nearest_2_row_id) %>% rownames_to_column("nearest_2")
x3 <- data.frame(nearest_3_row_id) %>% rownames_to_column("nearest_3")

cbind(col_1, x1, x2, x3)

                               col_1                          nearest_1 nearest_1_row_id
1                    My name is john                    My name is john               10
2        The best ever Puma wishlist        The best ever Puma wishlist                7
3 i have been mailing my issue daily i have been mailing my issue daily                6
4       Its perfect for a day at gym       Its perfect for a day at gym                9
                           nearest_2 nearest_2_row_id                          nearest_3 nearest_3_row_id
1                    My name is john                1                    My name is john                2
2        The best ever Puma wishlist                6        The best ever Puma wishlist                8
3 i have been mailing my issue daily                5 i have been mailing my issue daily               10
4       Its perfect for a day at gym                6       Its perfect for a day at gym                8

I made a function that

nearest <- function(i) {
  sapply(col_1, function(x) a =which(rank(stringdist(x, df_2$col_2, method = "jw")) == i))  %>%
    stack %>%
    dplyr::rename("nearest_{i}_row_id" := values, "nearest_{i}" := ind)
}

to get nearest ith's

nearest(1)
  nearest_1_row_id                          nearest_1
1               10                    My name is john
2                7        The best ever Puma wishlist
3                6 i have been mailing my issue daily
4                9       Its perfect for a day at gym

So, cbind(col_1, nearest(1), nearest(2), nearest(3)) will get the result you want

Nearest string match and their rowId

2 Answers2