1

I've made an algorithm to determine scores of matching strings from 2 dataframes in R. It will search for each row in test_ech the matching rows which their score is above 0.75 in test_data (based on the matching of 3 columns from each data frame).

Well, my code works perfectly with small data frame but I'm dealing with dataframes of 12m rows and the process will take at least 5 days to be done. So I think that if I discard "for loops" It will work but I really don't know how to do it. (and if there's extra changes that I need to do to lighten the process)

Thanks.

#score function :

library(stringdist)

score <- function(i,j) 
{  
s_n<-stringsim(test_ech[j,3],test_data[i,5],method = "jw",p=0.15)
s_v<-stringsim(test_ech[j,5],test_data[i,4],method = "jw",p=0.15)
s_c<-stringsim(test_ech[j,4],test_data[i,3],method = "jw",p=0.15)

 return(s_n*0.6+s_v*0.25+s_c*0.15)
}

#initialize result data frame :

resultat<-data.frame(nom_AS400=character(),ville_AS400=character(),cp_AS400=character(),                nom_SIRENE=character(),ville_SIRENE=character(),cp_SIRENE=character(),score=double())

#algo textmining :

system.time(for (j in 1:nrow(test_ech)) {

  for (i in 1:nrow(test_data)) {

    x<-score(i,j)

    if (x>0.75) {

ligne<-data.frame(nom_AS400=test_ech[j,3],
       ville_AS400=test_ech[j,5],
       cp_AS400=test_ech[j,4],
       nom_SIRENE=test_data[i,5],
       ville_SIRENE=test_data[i,4],
       cp_SIRENE=test_data[i,3],
       score=x)

      resultat<-rbind(resultat,ligne)      
    }  
  } 
})

test_ech : 65k rows and test_data : 12m rows

#test_ech (5 rows)
structure(list(societe_code = c("01", "01", "01", "01", "01"), 
    client_code = c("00048I", "00059Z", "00070Q", "00080W", "00131L"
    ), client_lib = c("CFA VAUBAN", "ALLRIM SA", "ATS CULLIGAN", 
    "AHSSEA", "ETS BRUNEAU P"), client_cp = c("25001", "25401", 
    "25480", "70002", "94700"), client_ville = c("BESANCON CEDEX", 
    "AUDINCOURT CEDEX", "ECOLE VALENTIN", "VESOUL CEDEX", "MAISONS ALFORT"
    )))

#test_data (5 rows)
structure(list(siren = c("005450093", "005450095", "005541552", 
"005580501", "005620117"), siret = c("00545009300033", "00545009300041", 
"00554155200039", "00558050100012", "00562011700019"), codePostalEtablissement = c("04800", 
"04802", "04260", "44600", "80100"), libelleCommuneEtablissement = c("GREOUX LES BAINS", 
"BAINS", "ALLOS", "SAINT NAZAIRE", "ABBEVILLE"), ref = c("PASSIONNEMENT GLAMOUR", 
"GLAMOUR", "LE SYMPA SNACK", "STEF", "DUBOIS")))

the expected output is a dataframe with the 3 reference columns from test_ech with the 3 matching columns from test_data and the score which should be >0.75

output link

Amine96
  • 65
  • 6
  • 1
    Hi Amine, can you give us a sample of your data base using the dput() function? – Luis Jun 04 '19 at 12:46
  • I'm afraid, you have to call `score` 12m·12m times, which is 1,4·10^14, which is a lot. No matter if you are using for loops or arrays… – ziggystar Jun 04 '19 at 12:50
  • You can run it in parallel, with a nice computer look into the future package – Bruno Jun 04 '19 at 12:53
  • @ziggystar I updated the post at the end...test_ech : 65k rows and test_data : 12m rows... So yes you have to call score 12m times for 65k rows in total – Amine96 Jun 04 '19 at 12:57
  • 1
    @Luis I've updated the post you can find the 2 data frames using the dput() function – Amine96 Jun 04 '19 at 13:06
  • Please double check your example data from `dput` it appears incomplete. And if you can make sure that some entries meet your score threshold. – emilliman5 Jun 04 '19 at 14:18
  • @Amine96, I'm unable to make your example work. After running do.call(cbind.data.frame, test_ech) to turn your dput object from a list to a dataframe (same for test_data), I run score(1, 1) and it doesn't work. The function itself is calling for the fourth and fifth column of test_ech and test_data, but the dput objects only have 3 columns – Luis Jun 04 '19 at 14:43
  • @emilliman5 dput code updated (sorry for the delay) – Amine96 Jun 04 '19 at 14:53
  • @Luis dput code updated (sorry for the delay) – Amine96 Jun 04 '19 at 14:53

2 Answers2

0

I'm not sure if this completely solves your problem given the dimensions of your original data, but you can reduce your time substantially by doing it over one for loop instead of two. You can do this because the stringsim function accepts a single character object on one side and a vector on the other.

    score_2 <- function(j) 
{  
  s_n <- stringsim(test_ech[[j,3]], test_data[[5]], method = "jw", p = 0.15)
  s_v <- stringsim(test_ech[[j,5]], test_data[[4]], method = "jw", p = 0.15)
  s_c <- stringsim(test_ech[[j,4]], test_data[[3]], method = "jw", p = 0.15)

  return(s_n * 0.6 + s_v * 0.25 + s_c * 0.15)
}

    stringsim (test_ech[,3], test_data[,5])

    resultat<-data.frame(nom_AS400=character(),ville_AS400=character(),cp_AS400=character(),                nom_SIRENE=character(),ville_SIRENE=character(),cp_SIRENE=character(),score=double())

    for (j in 1:nrow(test_ech)) {

      x <- score_2(j)

      x_75 = which(x > 0.75)

      if(length(x_75) > 0){
        for(i in x_75){

         ligne<-data.frame(nom_AS400=test_ech[[j,3]],
                           ville_AS400=test_ech[[j,5]],
                           cp_AS400=test_ech[[j,4]],
                           nom_SIRENE=test_data[[i,5]],
                           ville_SIRENE=test_data[[i,4]],
                           cp_SIRENE = test_data[[i,3]],                       
                           score = x[i])

      resultat<-rbind(resultat,ligne)

    }
   }
  }

Your function, repeating both your test objects 60 times:

  usuário   sistema decorrido 
     9.59      1.43     11.12 

This function, repeating both test objects 60 times:

  usuário   sistema decorrido 
     0.21      0.08      0.18 

Quite a bit faster :)

(note: there is stringdistmatrix which accepts vectors on both sides and returns a matrix, but sadly there is no stringsimmatrix. If you can map out the difference between stringdist and stringsim, running stringdistmatrix and adjusting it would probably be even faster).

Luis
  • 629
  • 4
  • 9
  • Thanks for your reply, I like the use of only one loop instead of two. Yet, there's an issue with the score function, it will calculate the score matching of one string from test_ech[,3] with a vector containing all strings in test_data[,5], and in the end it will return the MEAN value of the score when matching all strings in test_data and not each score for each matching. – Amine96 Jun 05 '19 at 08:35
  • I've updated your code in an answer post... Many thanks for your help – Amine96 Jun 05 '19 at 10:21
  • @Amine96 Ah, you're right, it should have been x[i] instead of x[x_75]. Anyway, at least your problem is solved. Do you mind accepting my answer? – Luis Jun 05 '19 at 13:51
  • No worries, yet you need to update your code with mine to mark it as an answer.(I also changed the score function with [[]] in test_data and test_ech) – Amine96 Jun 05 '19 at 14:57
0

Finally, I've solved the problem thanks to @Luis with using only one loop instead of two.

Code below :

    score_2 <- function(j) 
{  
  s_n <- stringsim(test_ech[[j,3]], test_data[[5]], method = "jw", p = 0.15)
  s_v <- stringsim(test_ech[[j,5]], test_data[[4]], method = "jw", p = 0.15)
  s_c <- stringsim(test_ech[[j,4]], test_data[[3]], method = "jw", p = 0.15)

  return(s_n * 0.6 + s_v * 0.25 + s_c * 0.15)
}

    stringsim (test_ech[,3], test_data[,5])

    resultat<-data.frame(nom_AS400=character(),ville_AS400=character(),cp_AS400=character(),                nom_SIRENE=character(),ville_SIRENE=character(),cp_SIRENE=character(),score=double())

    for (j in 1:nrow(test_ech)) {

      x <- score_2(j)

      x_75 = which(x > 0.75)

      if(length(x_75) > 0){
        for(i in x_75){

         ligne<-data.frame(nom_AS400=test_ech[[j,3]],
                           ville_AS400=test_ech[[j,5]],
                           cp_AS400=test_ech[[j,4]],
                           nom_SIRENE=test_data[[i,5]],
                           ville_SIRENE=test_data[[i,4]],
                           cp_SIRENE = test_data[[i,3]],                       
                           score = x[i])

      resultat<-rbind(resultat,ligne)

    }
   }
  }
Amine96
  • 65
  • 6