3

I was working on matching company names of two sets. I was trying to code it in Python with Levenstien's distance. I was having issues with short names of companies, and their trailing part like Pvt,Ltd. I have ran the same set with Excel Fuzzy lookup and was getting good results. I there a way that i can see how excel fuzzy lookup is coded and use the same implementation in python.

shashank
  • 400
  • 8
  • 25
  • 2
    No, Excel is not open-source. Microsoft wants people to pay for their products, so why would they make it possible to get their source code? – ForceBru Sep 28 '18 at 10:54
  • 2
    No, since Excel is closed source, but you could research and see if the Microsoft documentation details the algorithm used. – John Coleman Sep 28 '18 at 10:55

2 Answers2

3

The following is an excerpt from Microsoft Fuzzy Lookup Add-In for Excel, Readme.docx. I hope that helps.

Advanced Concepts

Fuzzy Lookup technology is based upon a very simple, yet flexible measure of similarity between two records. Jaccard similarity Fuzzy Lookup uses Jaccard similarity, which is defined as the size of the set intersection divided by the size of the set union for two sets of objects. For example, the sets {a, b, c} and {a, c, d} have a Jaccard similarity of 2/4 = 0.5 because the intersection is {a, c} and the union is {a, b, c, d}. The more that the two sets have in common, the closer the Jaccard similarity will be to 1.0.

Weighted Jaccard similarity and tokenization of records With Fuzzy Lookup, you can assign weights to each item in a set and define the weighted Jaccard similarity as the total weight of the intersection divided by the total weight of the union. For the weighted sets {(a, 2), (b, 5), (c, 3)}, {(a, 2), (c, 3), (d, 7)}, the weighted Jaccard similariyt is (2 + 3)/(2 + 3 + 5 +7) = 5/17 = .294.

Because Jaccard similarity is defined over sets, Fuzzy Lookup must first convert data records to sets before it calculates the Jaccard similarity. Fuzzy Lookup converts the data to sets using a Tokenizer. For example, the record {“Jesper Aaberg”, “4567 Main Street”} might be tokenized into the set, {“ Jesper”, “Aaberg”, “4567”, “Main”, “Street”}. The default tokenizer is for English text, but one may change the LocaleId property in Configure=>Global Settings to specify tokenizers for other languages.

Token weighting Because not all tokens are of equal importance, Fuzzy Lookup assigns weights to tokens. Tokens are assigned high weights if they occur infrequently in a sample of records and low weights if they occur frequently. For example, frequent words such as “Corporation” might be given lower weight, while less frequent words such as “Abracadabra” might be given a higher weight. One may override the default token weights by supplying their own table of token weights.

Transformations Transformations greatly increase the power of Jaccard similarity by allowing tokens to be converted from one string to another. For instance, one might know that the name “Bob” can be converted to “Robert”; that “USA” is the same as “United States”; or that “Missispi” is a misspelling of “Mississippi”. There are many classes of such transformations that Fuzzy Lookup handles automatically such as spelling mistakes (using Edit Transformations described below), string prefixes, and string merge/split operations. You can also specify a table containing your own custom transformations.

Jaccard similarity under transformations The Jaccard similarity under transformations is the maximum Jaccard similarity between any two transformations of each set. Given a set of transformation rules, all possible transformations of the set are considered. For example, for the sets {a, b, c} and {a, c, d} and the transformation rules {b=>d, d=>e}, the Jaccard similarity is computed as follows: Variations of {a, b, c}: {a, b, c}, {a, d, c} Variations of {a, c, d}: {a, c, d}, {a, c, e} Maximum Jaccard similarity between all pairs: J({a, b, c}, {a, c, d}) = 2/4 = 0.5 J({a, b, c}, {a, c, e}) = 2/4 = 0.5 J({a, d, c}, {a, c, d}) = 3/3 = 1.0 J({a, d, c}, {a, c, e}) = 2/4 = 0.5 The maximum is 1.0. Note: Weghted Jaccard similiary under transformations is simply the maximum weighted Jaccard similarity across all pairs of transformed sets.

Edit distance Edit distance is the total number of character insertions, deletions, or substitutions that it takes to convert one string to another. For example, the edit distance between “misissipi” and “mississippi” is 2 because two character insertions are required. One of the transformation providers that’s included with Fuzzy Lookup is the EditTransformationProvider, which generates specific transformations for each input record and creates a transformation from the token to all words in its dictionary that are within a given edit distance. The normalized edit distance is the edit distance divided by the length of the input string. In the previous example, the normalized edit distance is 2/9 = .222.

Quantamax
  • 199
  • 2
  • 8
1

As I know Fuzzy Lookup use Jaccard similarity. Look into its documentation. As I play around you can also try SequenceMatcher and play with similarity levels - it gives very nice results after some unifying text strings.

KaBi
  • 35
  • 5