0

I have two data frames with text data about users:

x <- data.frame("Address_line1" = c("123 Street","21 Hill drive"), 
                "City" = c("Chicago","London"), "Phone" = c("123","219"))
y <- data.frame("Address_line1" = c("461 road","PO Box 123","543 Highway"), 
                "City" = c("Dallas","Paris","New York" ), "Phone" = c("235","542","842"))
> x
  Address_line1     City Phone
1    123 Street  Chicago   123
2 21 Hill drive   London   219


> y
  Address_line1     City Phone
1      461 road   Dallas   235
2    PO Box 123    Paris   542
3   543 Highway New York   842

For each row of the x dataframe, I want to iterate over all the rows in y, compare the corresponding columns (address to address, city to city etc.) and obtain the string distance for each.

So for the first row of x, I want an output like:

[16 20 20]

Where 16 is

stringdist("123 Street","461 road", method = "lv")+
stringdist("Chicago","Dallas", method = "lv")+
stringdist("123","235", method = "lv") 

20 is the sum for second row and 20 for third.

Similarly, I want a list containing nrow(y) elements for each row of x.

Rahul
  • 23
  • 2

1 Answers1

1

We can use for loop

out <- c()
for(i in seq_len(nrow(x))) {
    for(j in seq_len(nrow(y))) {
     x1 <- x[i,]; y1 <- y[j,]
     out <- c(out, sum(unlist(Map(stringdist, x1, y1, 
          MoreArgs = list(method = 'lv')))))
      }
 }

out
#[1] 16 20 20 19 20 21

It is not clear about the expected. We can also use tidyverse methods

library(dplyr)
library(tidyr)
library(purrr)
library(stringdist)
library(stringr)
crossing(x, y, .name_repair = 'unique') %>%
   rename_all(~ str_remove(., "\\.{2,}")) %>% 
   split.default(str_remove(names(.), "\\d+$")) %>%
   map(~ pmap(.x,  ~ stringdist(..1, ..2, method = 'lv'))) %>% 
   transpose %>% 
   map_dbl(~ flatten_dbl(.x) %>% 
            sum)
#[1] 16 20 20 19 21 20
akrun
  • 874,273
  • 37
  • 540
  • 662