0

I have an email list in SAS dataset. I want to identify similar email address from the list. I am trying to implement COMPGED function across all the rows for email variable. I need to sort the list based on similar distance so that similar email address become neighbours. Can anybody help on this please ?

  • 1
    What's your code at the moment? What's wrong with the result? – Sven R. Feb 07 '16 at 18:31
  • For this type of linkage you can try the options here, the solution from @friedegg is good in terms of compged and the reference to the the-link-king.com is a good option as well. https://communities.sas.com/t5/SAS-Procedures/Name-matching/m-p/82780/highlight/true#M23757 – Reeza Feb 07 '16 at 20:43

1 Answers1

0

Do a self join in proc sql, using the result of compged as criteria for join condition :

Example :

proc sql ;
  create table similar_emails as
  select a.Email as EmailA, b.Email as EmailB
  from email_list a
       left join
       email_list b on compged(a.Email,b.Email) <= 200 
  order by a.Email ;
quit ;
Chris J
  • 7,549
  • 2
  • 25
  • 25
  • But I have only one email list. Suppose I have n no. Of email ids. I have to compare 1st email id with rest (n-1) email ids, 2nd email id with rest (n-1) ids. – Arpan Mondal Feb 07 '16 at 20:07
  • Use a cross join instead of left join, sort on the score and add it the select statement as well. – Reeza Feb 07 '16 at 20:39
  • My example is based on a single list of emails. If you wish to exclude an email from matching to itself, give each row an ID in a preceding datastep, and add `and a.ID ^= b.ID` to the join condition. – Chris J Feb 08 '16 at 08:35
  • 1
    Or you can just delete any observation where emailA=emailB – invoketheshell Feb 08 '16 at 16:42