I have a data frame containing user data
x <- data.frame("Address_line1" = c("461 road","PO Box 123","543 Highway"),
"City" = c("Dallas","Paris","New York" ), "Phone" = c("235","542","842"))
x
Address_line1 City Phone
1 461 road Dallas 235
2 PO Box 123 Paris 542
3 543 Highway New York 842
I have a list having the same features in the same order as the dataframe:
y = c("443 road","New york","842")
names(y) = colnames(x)
y
Address_line1 City Phone
"443 road" "New york" "842"
I want to iterate through each row of this dataframe, compute the stringdist()
of corresponding fields of x with y, sum these values and get a total score for each row.
For example, the score for first row will be:
row_1 = stringdist("461 road","443 road",method="lv") + stringdist("Dallas","New york",method="lv") + stringdist("235","842",method="lv")
row_1
[1] 13
Similarly, I want a score for all rows of the dataframe. This is the code I have written using for loops:
list_dist = rep(NA,0)
for(i in seq_len(nrow(x))){
list_x = x[i,]
sum=0
for(j in seq_len(length(y))){
sum = sum + stringdist(y[j],list_x[[j]],method = "lv")
}
#print(sum)
list_dist[i] = sum
}
list_dist
[1] 13 18 8
I'm able to get the desired output, but the issue is the computation time. Since my original table contains ~100k rows and 10 columns, it takes close to 30 mins for the code to run. I was wondering if there is a more efficient method to do this.