0

I'm trying to clean up our customer database by identifying customer data that is similar enough to consider them the same customer (thus, give them the same customer id). I've concatenated relevant customerdata into one column named customerdata. I've found the R package stringdist and I'm using the following code to calculate the distance between every single record:

output <- df$id


 for(i in 1:(length(df$customerdata)-1) ){
      for(j in (i+1):length(df$customerdata)){
          if(abs(df$customerdataLEN[i]-df$customerdataLEN[j]) < 10){

          
          if( stringdist(df$customerdata[i],df$customerdata[j])<10){
            output[j] <- df$id[i]
          }
          
        }
        
      }
    }

df$newcustomerid <- output

So here, I first initialize a vector named output with customerid data. Then, I loop through all customers. I have a column called customerdatalength. To reduce calculation time, I first check if there is large (>10) difference in length between columns. If that is the case, I don't bother calculating the stringdist. Otherwise, if the distance between the two customers is < 10, I consider them to be the same customer, and I give that customer the same id.

I'm looking to speed up the process however. At 2000 rows, this loop takes 2 minutes. At 7400 rows, this loop takes 32 minutes. I'm looking to run this on around 1 000 000 rows. Does anyone have any idea on how to improve the speed of this loop?

  • since the distance between customer A and customer B is the same as the distance between customer B and customer A... you should not repeat that calculation. Likewise, skip the calculation when you are comparing customer A to customer A. – cory Mar 18 '21 at 13:53
  • Please post examples of your data and the result you are after (as the output from ```dput()```). – rjen Mar 18 '21 at 21:09

0 Answers0