0

I have a table with thousands of rows.

Sample data:

user_id ZIP City    email
105 100051  Lond.   jsmith@hotmail.com
382 251574          jgjefferson@gmail.com
225 0100051 London  john.smith@hotmail.com

I need to compare every user with the others, to be able to know which ones are similar.

In the example given, the user 105 and 225 are almost the same, so the expected result would be a column of a new id that matches the two of them, like this:

user_id ZIP City    email                   new_id
105 100051  Lond.   jsmith@hotmail.com      105
382 251574          jgjefferson@gmail.com   382
225 0100051 London  john.smith@hotmail.com  105

How would I compare every field with the others, and know how to compare them, like clustering, for example?

grizzthedj
  • 7,131
  • 16
  • 42
  • 62
ecp
  • 319
  • 1
  • 6
  • 18
  • 1
    SO isn't a homework site.. sorry. Show effort (posting what you tried to solve your answer) and you will be rewarded ;-) – ZF007 Feb 02 '18 at 16:38
  • you can use the `agrep` function to find all the matches: eg `sapply(dat$email,agrep,dat$email,value=T)` will give you the matches for the email – Onyambu Feb 02 '18 at 16:53
  • Sorry ZF007, it was my first question here. What I tried to do was in mysql the match() function and even comparing equal fields, getting rid of capital letters, for example. But didn't worked as I expected. I am quite new with R but I was thinking of a distance matrix or something (clustering, f.e). – ecp Feb 05 '18 at 07:43

1 Answers1

1

Your emails:

email<-c("jsmith@hotmail.com","jgjefferson@gmail.com","john.smith@hotmail.com")

Distance between emails:

dist<-stringdistmatrix(email,email,method="jw")
dist[dist==0]<-1

Minimum distance between emails:

cbind(email,email_near=email[apply(dist, 1, which.min)],dist=apply(dist, 1, FUN=min))

     email                    email_near               dist               
[1,] "jsmith@hotmail.com"     "john.smith@hotmail.com" "0.208754208754209"
[2,] "jgjefferson@gmail.com"  "jsmith@hotmail.com"     "0.281746031746032"
[3,] "john.smith@hotmail.com" "jsmith@hotmail.com"     "0.208754208754209"

After that I suggest to use a threshold on dist to identify closest emails and then compute the new_ID.

Terru_theTerror
  • 4,918
  • 2
  • 20
  • 39
  • jsmith123@hotmail.com and jsmith123@someotherdomain.com would have large distance but would obviously be similar. Perhaps split on the `@` sign and compare parts independently. – thc Feb 02 '18 at 17:47
  • In the end my solution is not deterministic. Yuo could have name1@web.com and name2@web.com caming from differents users or nameX@web.com and nameX@web2.com in the same way. – Terru_theTerror Feb 03 '18 at 08:32
  • So, for comparing all fields, I'd need to sum the distances of each variable? – ecp Feb 05 '18 at 07:41
  • I also need to compare more than 2 emails, like: jsmith123@hotmail.com, jsmith123@someotherdomain.com, johnsmith@hotmail.com, can I use the same method? – ecp Feb 05 '18 at 07:48
  • The code works with any number of emails. In "email_near" you could find the most close email to "email" and "dist" rapresents the string distanace (between 0 [equal] and 1 [maximum distance]). You don't have to sum the distances, you have just to consider it when it is very small (E.g. dist<0.25) – Terru_theTerror Feb 05 '18 at 08:28
  • OK, thank you Terru. I have tried several libraries, such "stringdist" and "reshape2", but I can't find the function. What library do you use for functions stringdistmatrix and stringdist? – ecp Feb 05 '18 at 09:54
  • I reinstalled R and get the library "stringdist", but seems not to be working for huge files. I tried with 20.000 emails and crushes. Is it because of my RAM? I got 4 GB. – ecp Feb 05 '18 at 10:12
  • Ok, 20.000 emails is too much. Have a look to this topic concerning "Efficient string similarity grouping", you will find also an answer of mine. https://stackoverflow.com/questions/48058104/efficient-string-similarity-grouping/48098679#48098679 – Terru_theTerror Feb 05 '18 at 11:20
  • Thank you very much Terru, I am going to work with this, greetings. – ecp Feb 06 '18 at 07:48