Which string distance algorithm to detect tiny changes?

Question

I'm working on a phishing email filter project and as the first step to guess if an email is phishing or not, without using external APIs, I want to compare the visible text and the underlying URL of the links.

e.g.:

<a href="http://faceb00k.com">Facebook</a>

<a href="http://facedook.com">Facebook</a>

are high indicators of phishing.

Initially, I was aware only of the Levenshtein distance, which I thought was a good measure but then I realized that after normalization it is not a good indicator for this kind of task, because it is hardly higher than 0.5.

By normalization I mean:

normalized = levenshtein / MAX(a.length, b.length)

The other algorithms that seem to work better are the cosine distance and the Jaro-Winkler Distance.

In the above case, after lowercasing and trimming both of them, and removing the protocols and the top-level domain, as shown in the below code:

public interface RegEx {
    String PROTOCOL = "^http(s)?://";
    String WWW_PREFIX = "www\\.";
    String TOP_LEVEL_DOMAIN = "\\.[A-z0-9\\-]*$";
}

.

import org.apache.commons.text.similarity.CosineDistance;
import org.apache.commons.text.similarity.JaccardDistance;
import org.apache.commons.text.similarity.JaroWinklerDistance;
import org.apache.commons.text.similarity.LevenshteinDistance;

import java.util.regex.Pattern;

public class Test implements RegEx {
    public static void main(String[] args) {
        String text = "Facebook";
        String url = "https://www.facedook.com";

        System.out.println("Text: " + text);
        System.out.println("URL: " + url + "\n");

        // RegEx
        Pattern protocolPattern = Pattern.compile(PROTOCOL);
        Pattern prefixPattern = Pattern.compile(WWW_PREFIX);
        Pattern topLevelDomainPattern = Pattern.compile(TOP_LEVEL_DOMAIN);

        // Remove protocol
        text = protocolPattern.matcher(text).replaceAll("");
        url = protocolPattern.matcher(url).replaceAll("");

        // Remove www prefix
        text = prefixPattern.matcher(text).replaceAll("");
        url = prefixPattern.matcher(url).replaceAll("");

        // Remove Top Level Domain
        text = topLevelDomainPattern.matcher(text).replaceAll("");
        url = topLevelDomainPattern.matcher(url).replaceAll("");


        text = text.toLowerCase().trim();
        url = url.toLowerCase().trim();

        System.out.println("Text: " + text);
        System.out.println("URL: " + url + "\n");

        double levenshteinDistance = new LevenshteinDistance().apply(text, url);
        double normalizedLevenshteinDistance = levenshteinDistance / (double) Math.max(text.length(), url.length());
        System.out.println("Normalized Levenshtein Distance: " + normalizedLevenshteinDistance);

        double cosineDistance = new CosineDistance().apply(text, url);
        System.out.println("Cosine Distance: " + cosineDistance);

        double jaccardDistance = new JaccardDistance().apply(text, url);
        System.out.println("Jaccard Distance: " + jaccardDistance);

        double jaroVinklerDistance = new JaroWinklerDistance().apply(text, url);
        System.out.println("JaroWinkler Disance: " + jaroVinklerDistance);
    }
}

These are the distances I got in the console:

Text: Facebook
URL: https://www.facedook.com

Text: facebook
URL: facedook

Normalized Levenshtein Distance: 0.125
Cosine Distance: 1.0
Jaccard Distance: 0.25
JaroWinkler Disance: 0.95

So we can clearly see that the Cosine and Jaro-Winkler distances seem to have the correct insight for phishing links detection.

Are they good for the purpose or are there other distance functions better suited for this task? Explaining better what I'm looking for, is there some distance function between strings which gives a higher value/distance if a character is replaced by another one that looks similar to the human eye?

I don't know if there is a good algorithm that fits your need but [this article](https://unicode.org/reports/tr39/) of the unicode consortium may be of your interest. It is about unicode look-alike characters and chapter 5 is about Detection Mechanisms. — Sebastian, Aug 08 '19 at 13:07
how about using a hash? hash value for string will be completely different from original one if only a single character changes — mangusta, Aug 08 '19 at 13:10
either a url is a phishing one or it is not. what's the point of measuring similarities? even if similarity is very high it still does not guarantee that url is safe — mangusta, Aug 08 '19 at 13:17
If it's the first time that a phishing filter encounters such a url, it would not be in the black-list yet. I know about the hash properties, but then how to turn the differences between the hashes into a [0-1] value since they will have for sure not much in common both if only 1 character has changed or all of them have done? — 1Z10, Aug 08 '19 at 13:29

Which string distance algorithm to detect tiny changes?

0 Answers0