Fuzzy compare with weighted fields (suggest similar instances)

Question

Today I came across a certain task and enjoyed solving it with a clean code, so decided it'd be cool to share it with the rest of the class - but hey, lets keep it in the format of a question.

The task:

Given an instance of type T (source) and a collection of instances of type T (possible suggestions), Provide suggestions that are similar to the source, ordered by similarity, and entirely excluding suggestions which their similarity is below a certain threshold.

Similarity will be fuzzy-string comparison of multiple fields of the instance, each field with an importance weight.

Example input:

Source instance:

{A = "Hello", B = "World", C = "and welcome!"}

Possible suggestions:

{A = "Hola", B = "World", C = "Welcome!"}
{A = "Bye", B = "world", C = "and fairwell"}
{A = "Hell", B = "World", C = "arrives..."}
{A = "Hello", B = "Earth", C = "and welcome!"}
{A = "Hi", B = "world", C = "welcome!"}

Importance of fields:

A: 30%
B: 50%
C: 20%

Example output:

[0] = {A = "Hell", B = "World", C = "arrives..."}
[1] = {A = "Hola", B = "World", C = "Welcome!"}
[2] = {A = "Hello", B = "Earth", C = "and welcome!"}
[3] = {A = "Hi", B = "world", C = "welcome!"}

Note that the possible suggestion Bye;world;and fairwell is not here at all, as it doesn't meet the minimum similarity threshold (lets say the threshold is at least 50% weighted-similarity)

The first result is the most similar to the source, even though the C field is not similar at all to the source, because we gave C a weight as low as 20%, and the other two more-heavy-weighted fields are very similar (or an exact match) to the source.

Fuzzy comparison side-note

The algorithm to be used for comparing string a and string b can be any of the known fuzzy comparison algorithms, that's not really the point here.

So how could one turn that list of possible suggestions into an actual list of ordered suggestions? (Oh lord, please help, etc)

score 0 · Accepted Answer · edited Jun 20 '20 at 09:12

For our case, lets use the awesome Levenshtein distance algorithm.

So assume we have a function with the following signature:

private static int CalcLevenshteinDistance(string a, string b)

And to actually get the similarity between a and b, rather than distance, we'll use:

private static decimal CalcLevenshteinSimilarity(string a, string b)
{
    return 1 - ((decimal)CalcLevenshteinDistance(a, b) /
                Math.Max(a.Length, b.Length));
}

This will return exactly 1 if the strings are exactly the same, 0 if the strings are not similar at all, or anywhere between. For example, 0.89 would be that a and b are 89% similar (not bad!)

To help us with the weighted fields, lets create a little helper-class:

public class SuggestionField
{
    public string SourceData { get; set; }
    public string SuggestedData { get; set; }
    public decimal Importance { get; set; }
}

This will represent all the information needed to match a single field of type T to the source T instance.

Now calculating the weighted similarity between a single suggestion and the source is fairly simple:

private static decimal RateSuggestion(IEnumerable<SuggestionField> fields)
{
    return fields.Sum(x =>
        x.Importance * CalcLevenshteinSimilarity(x.SourceData,
                                                 x.SuggestedData));
}

Now lets wrap it in a function that gets all possible suggestions, as well as the SuggestionFields in a really cool and easy-to-use fashion:

public static IEnumerable<T> Suggest<T>
    (IEnumerable<T> possibleSuggestions,
     params Func<T, SuggestionField>[] fieldSelectors)
{
    return possibleSuggestions
        .Select(x => new
                     {
                         Suggestion = x,
                         Similarity = RateSuggestion(fieldSelectors.Select(f => f(x)))
                     })
        .OrderByDescending(x => x.Similarity)
        .TakeWhile(x => x.Similarity > 0.5m) // <-- Threshold here!
        .Select(x => x.Suggestion);
}

Okay, okay, that piece of code can be a little confusing at first glance, but relax. The main confusion probably comes from params Func<T, SuggestionField>[] fieldSelectors and therefore from the Similarity = RateSuggestion(fieldSelectors.Select(f => f(x))) as well.

To those of you who are strong on Linq and all those games with selectors, it might already be understood how one could use that function. In any case, it will be clear in just a moment!

Usage:

// I'll be using anonymous types here, but you don't have to be lazy about it
var src = new {A = "Hello", B = "World", C = "and welcome!"};
var possibleSuggestions =
    new[]
    {
        new {A = "Hola", B = "World", C = "Welcome!"},
        new {A = "Bye", B = "world", C = "and fairwell"},
        new {A = "Hell", B = "World", C = "arrives..."},
        new {A = "Hello", B = "Earth", C = "and welcome!"},
        new {A = "Hi", B = "world", C = "welcome!"}
    };

var suggestions =
    Suggest(possibleSuggestions,
            x => new SuggestionField
                 {
                     SourceData = src.A,
                     SuggestedData = x.A,
                     Importance = 0.3m // 30%
                 },
            x => new SuggestionField
                 {
                     SourceData = src.B,
                     SuggestedData = x.B,
                     Importance = 0.5m // 50%
                 },
            x => new SuggestionField
                 {
                     SourceData = src.C,
                     SuggestedData = x.C,
                     Importance = 0.2m // 20%
                 }).ToArray();

This might look good to you as it is, or it could be altered to have a usage more to your liking, but I hope the idea is clear and someone will find it useful ;)

P.S

Of course, the similarity threshold can be passed in as a parameter. Feel free to add any idea and comment on how to make this better or more readable as well!

Fuzzy compare with weighted fields (suggest similar instances)

The task:

Example input:

Example output:

1 Answers1

Usage: