I'm working on a phishing email filter project and as the first step to guess if an email is phishing or not, without using external APIs, I want to compare the visible text and the underlying URL of the links.
e.g.:
<a href="http://faceb00k.com">Facebook</a>
<a href="http://facedook.com">Facebook</a>
are high indicators of phishing.
Initially, I was aware only of the Levenshtein distance, which I thought was a good measure but then I realized that after normalization it is not a good indicator for this kind of task, because it is hardly higher than 0.5.
By normalization I mean:
normalized = levenshtein / MAX(a.length, b.length)
The other algorithms that seem to work better are
the cosine distance
and the Jaro-Winkler Distance
.
In the above case, after lowercasing and trimming both of them, and removing the protocols and the top-level domain, as shown in the below code:
public interface RegEx {
String PROTOCOL = "^http(s)?://";
String WWW_PREFIX = "www\\.";
String TOP_LEVEL_DOMAIN = "\\.[A-z0-9\\-]*$";
}
.
import org.apache.commons.text.similarity.CosineDistance;
import org.apache.commons.text.similarity.JaccardDistance;
import org.apache.commons.text.similarity.JaroWinklerDistance;
import org.apache.commons.text.similarity.LevenshteinDistance;
import java.util.regex.Pattern;
public class Test implements RegEx {
public static void main(String[] args) {
String text = "Facebook";
String url = "https://www.facedook.com";
System.out.println("Text: " + text);
System.out.println("URL: " + url + "\n");
// RegEx
Pattern protocolPattern = Pattern.compile(PROTOCOL);
Pattern prefixPattern = Pattern.compile(WWW_PREFIX);
Pattern topLevelDomainPattern = Pattern.compile(TOP_LEVEL_DOMAIN);
// Remove protocol
text = protocolPattern.matcher(text).replaceAll("");
url = protocolPattern.matcher(url).replaceAll("");
// Remove www prefix
text = prefixPattern.matcher(text).replaceAll("");
url = prefixPattern.matcher(url).replaceAll("");
// Remove Top Level Domain
text = topLevelDomainPattern.matcher(text).replaceAll("");
url = topLevelDomainPattern.matcher(url).replaceAll("");
text = text.toLowerCase().trim();
url = url.toLowerCase().trim();
System.out.println("Text: " + text);
System.out.println("URL: " + url + "\n");
double levenshteinDistance = new LevenshteinDistance().apply(text, url);
double normalizedLevenshteinDistance = levenshteinDistance / (double) Math.max(text.length(), url.length());
System.out.println("Normalized Levenshtein Distance: " + normalizedLevenshteinDistance);
double cosineDistance = new CosineDistance().apply(text, url);
System.out.println("Cosine Distance: " + cosineDistance);
double jaccardDistance = new JaccardDistance().apply(text, url);
System.out.println("Jaccard Distance: " + jaccardDistance);
double jaroVinklerDistance = new JaroWinklerDistance().apply(text, url);
System.out.println("JaroWinkler Disance: " + jaroVinklerDistance);
}
}
These are the distances I got in the console:
Text: Facebook
URL: https://www.facedook.com
Text: facebook
URL: facedook
Normalized Levenshtein Distance: 0.125
Cosine Distance: 1.0
Jaccard Distance: 0.25
JaroWinkler Disance: 0.95
So we can clearly see that the Cosine and Jaro-Winkler distances seem to have the correct insight for phishing links detection.
Are they good for the purpose or are there other distance functions better suited for this task? Explaining better what I'm looking for, is there some distance function between strings which gives a higher value/distance if a character is replaced by another one that looks similar to the human eye?