1

I am doing some fuzzy text matching to match school names. Here is an example of my data, which is two columns in a tibble:

data <- tibble(school1 = c("abilene christian", "abilene christian", "abilene christian", "abilene christian"),
               school2 = c("a t still university of health sciences", "abilene christian university", "abraham baldwin agricultural college", "academy for five element acupuncture"))
data
# A tibble: 4 x 2
school1           school2                                
  <chr>             <chr>                                  
1 abilene christian a t still university of health sciences
2 abilene christian abilene christian university           
3 abilene christian abraham baldwin agricultural college   
4 abilene christian academy for five element acupuncture 

What I would like to do is use stringdist to run through all the available methods and return a table that looks like this, where my original text remains in addition to a column for each method and the value returned:

# A tibble: 4 x 12
  school1           school2       osa    lv    dl hamming   lcs qgram cosine jaccard    jw soundex
  <chr>             <chr>       <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl>   <dbl>
1 abilene christian a t still …  29.0  29.0  29.0     Inf  36.0  24.0 0.189    0.353 0.442    1.00
2 abilene christian abilene ch…  11.0  11.0  11.0     Inf  11.0  11.0 0.0456   0.200 0.131    0   
3 abilene christian abraham ba…  28.0  28.0  28.0     Inf  35.0  25.0 0.274    0.389 0.431    1.00
4 abilene christian academy fo…  28.0  28.0  28.0     Inf  37.0  29.0 0.333    0.550 0.445    1.00

I can get this to work using a for loop using the following:

  method_list <- c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex")
    for (i in method_list) {
  data[, i] <- stringdist(data$school1, data$school2, method = i)
}

What I would like to do it convert this into the more readable dplyr syntax, but I can't get the loop to work with mutate. Here is what I have:

for (i in method_list) {
      ft_result <- data %>% 
                     mutate(i = stringdist(school1, school2, method = i))            
    }

Running this returns 1 additional column added to my original data called "i" with a value of 1 for every row.

Question 1: Is a for-loop the best way to accomplish what I am trying to get to? I looked at purrr to see if I could use something like map or invoke, but I don't think any of those functions do what I want.

Question 2: If a for-loop is the way to go, how can I make it work with mutate? I tried using mutate_at, but that didn't work either.

Jenna Allen
  • 454
  • 3
  • 11

1 Answers1

3

This seems like a great place to use purrr::map_dfc

General idea here is to map through the function passing each method as an input and wrapping the result in a dataframe. purrr::set_names also comes in handy.


library(tidyverse)
library(stringdist)

method_list <- c("osa", "lv", "dl", "hamming", "lcs", "qgram",
                 "cosine", "jaccard", "jw", "soundex")

tb <- starwars[c("name", "homeworld")]

method_list %>%
  map_dfc(function(str_method) {
    data_frame(stringdist(tb$name, tb$homeworld, method = str_method))
    }
  ) %>%
  set_names(method_list) %>%
  bind_cols(tb, .)
#> Warning in do_dist(a = b, b = a, method = method, weight = weight, maxDist
#> = maxDist, : Non-printable ascii or non-ascii characters in soundex.
#> Results may be unreliable. See ?printable_ascii.
#> # A tibble: 87 x 12
#>                  name homeworld   osa    lv    dl hamming   lcs qgram
#>                 <chr>     <chr> <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>
#>  1     Luke Skywalker  Tatooine    13    13    13     Inf    18    18
#>  2              C-3PO  Tatooine     8     8     8     Inf    13    13
#>  3              R2-D2     Naboo     5     5     5       5    10    10
#>  4        Darth Vader  Tatooine     8     8     8     Inf    13    13
#>  5        Leia Organa  Alderaan     8     8     8     Inf    11     9
#>  6          Owen Lars  Tatooine     9     9     9     Inf    15    11
#>  7 Beru Whitesun lars  Tatooine    16    16    16     Inf    22    16
#>  8              R5-D4  Tatooine     8     8     8     Inf    13    13
#>  9  Biggs Darklighter  Tatooine    14    14    14     Inf    19    17
#> 10     Obi-Wan Kenobi   Stewjon    13    13    13     Inf    17    15
#> # ... with 77 more rows, and 4 more variables: cosine <dbl>,
#> #   jaccard <dbl>, jw <dbl>, soundex <dbl>
zlipp
  • 790
  • 7
  • 16