24

I've got a stand-alone CSV data loading process that I coded in Java that has to use some fuzzy string matching. It's definitely not ideal, but I don't have much choice. I am matching using a first and last name and I cache all the possibilities at the start of a run. After finding the match, I then need that person object multiple places during the run. I used guava's Objects.hashCode() to create a hash out of the firstname and lastname.

The caching mechanism looks like this:

Map<Integer,PersonDO> personCache = Maps.newHashMap();
for(PersonDO p: dao.getPeople()) {
    personCache.put(Objects.hashCode(p.getFirstName(),p.getLastName()), p);
}

Most of the time I get hits on firstname+lastname, but when it misses I'm falling back using Apache's StringUtils.getLevenshteinDistance() to try and match it. This is how the matching logic flow goes:

    person = personCache.get(Objects.hashCode(firstNameFromCSV,lastNameFromCSV));
    if(person == null) {//fallback to fuzzy matching
        person = findClosetMatch(firstNameFromCSV+lastNameFromCSV);

    }

This is the findClosetMatch() method:

private PersonDO findClosetMatch(String name) {
    int min = 15;//initial value
    int testVal=0;
    PersonDO matchedPerson = null;
    for(PersonDO person: personCache.values()) {
        testVal = StringUtils.getLevenshteinDistance(name,person.getFirstName()+person.getLastName());
        if( testVal < min ) {
            min = testVal;
            matchedPerson = person;
        }
    }
    if(matchedPerson == null) {
        throw new Exception("Unable to find person: " + name) 
    }
    return matchedPerson;
}

This works fine with simple spelling errors, typos and shortened names(ie Mike->Michael), but when I'm completely missing one of the incoming names in the cache I end up returning a false positive match. To help keep this from happening I set the min value in findClosetMatch() to 15(ie no more than 15 chars off); it works most of the time but I've still had a few mis-matches occur: Mike Thompson hits on Mike Thomas etc.

Outside of figuring out a way to get a primary key into the file being loaded, does anyone see a way to improve this process? Any other matching algorithms that could help here?

Durandal
  • 5,575
  • 5
  • 35
  • 49
  • 8
    The answer may be to push back on the data provider. There's only so much you can do with shit data. At some point you can't fix the problem with code, and really you shouldn't, because you shouldn't allow a false match to be possible. – Bohemian Jan 11 '14 at 02:45
  • @Bohemian Ideally we'll get this resolved on that providers side of the fence eventually, as it is I'm handling most of this manually. Was hoping I was overlooking some way to simplify this problem; may just be a bad problem. – Durandal Jan 11 '14 at 03:47
  • Side note: do you realise that due to hashcode collisions your mail may not contain all your persons? – assylias Jan 11 '14 at 03:52
  • @assylias Isn't a hash collision extremely unlikely in this scenario? This is a single run process that goes away after each run. I should have also added, the max number of people that get cached in a run is around a 1000. – Durandal Jan 11 '14 at 06:15
  • 3
    I am not sure what kind of typos are you facing, but distance of 15 is pretty extreme. I would preprocess the words before comparing (lower case, accents / transliteration, ...). Matching both given name and surename separately would also reduce required minimum checked distance. And finally it would be good if you can do some additional safety check after you find a match - e.g. check that they have the same date of birth or address, ... – Pavel Horal Jan 13 '14 at 07:29
  • 1
    @MagicMan, I agree with Bohemian, but if you really have got that bad data provider, you can preprocess CSV manually. And there you can make you software to assist you. For example, you can introduce preprocess phase, which finds all the fuzzy matches and report about them. And also it should report first about ambigous matches (when both `Mike Thompson` and `Mike Thomas` exist and both are matching to a name from CSV). So you can manually fix these mistakes and run the second phase, that actually alters database. – Alexey Andreev Jan 13 '14 at 08:42

4 Answers4

46

As I look at this problem I notice a couple key facts to base some improvements on:

Facts and observations

  1. Max iterations of 1000.
  2. 15 for Levenshtein distance sounds really high to me.
  3. You know, by observing the data empirically, what your fuzzy matching should look like (there are many cases for fuzzy matching and each depends on why the data is bad).
  4. By building this API-like you could plug in many algorithms, including your own and others like Soundex, instead of depending on just one.

Requirements

I've interpreted your problem as requiring the following two things:

  1. You have PersonDO objects that you want to look up via a key that is based on the name. It sounds like you want to do this because you need a pre-existing PersonDO of which one exists per unique name, and the same name may come up more than once in your loop/workflow.
  2. You need "fuzzy matching" because the incoming data is not pure. For the purposes of this algorithm we'll assume that if a name "matches", it should always use the same PersonDO (in other words, a person's unique identifier is their name, which is obviously not the case in real life, but seems to work for you here).

Implementation

Next, let's look at making some improvements on your code:

1. Cleanup: unnecessary hashcode manipulation.

You don't need to generate hash codes yourself. This confuses the issue a bit.

You're simply generating a hashcode for the combination of the firstname + lastname. This is exactly what HashMap would do if you gave it the concatenated string as a key. So, just do that (and add a space, just in case we want to reverse parse out first/last from the key later).

Map<String, PersonDO> personCache = Maps.newHashMap();

public String getPersonKey(String first, String last) {
  return first + " " + last;
}

...
// Initialization code
for(PersonDO p: dao.getPeople()) {
    personCache.put(getPersonKey(p.getFirstName(), p.getLastName()), p);
}

2. Cleanup: Build a retrieval function to perform the lookup.

Since we've changed the key in the map we need to change the lookup function. We'll build this like a mini-API. If we always knew the key exactly (i.e. unique IDs) we would of course just use Map.get. So we'll start with that, but since we know we will need to add fuzzy matching we'll add a wrapper where this can happen:

public PersonDO findPersonDO(String searchFirst, String searchLast) {
  return personCache.get(getPersonKey(searchFirst, searchLast));
}

3. Build a fuzzy matching algorithm yourself using scoring.

Note that since you are using Guava, I've used a few conveniences here (Ordering, ImmutableList, Doubles, etc.).

First, we want to preserve the work we do to figure out how close a match is. Do this with a POJO:

class Match {
   private PersonDO candidate;
   private double score; // 0 - definitely not, 1.0 - perfect match

   // Add candidate/score constructor here
   // Add getters for candidate/score here

   public static final Ordering<Match> SCORE_ORDER =
       new Ordering<Match>() {
     @Override
     public int compare(Match left, Match right) {
       return Doubles.compare(left.score, right.score);
     }
   };
}

Next, we create a method for scoring a generic name. We should score first and last names separately, because it reduces noise. For instance, we don't care if the firstname matches any part of the last name — unless your firstname might accidentally be in the lastname field or vice versa, which you should account for intentionally and not accidentally (we'll address this later).

Note that we no longer need a "max levenshtein distance". This is because we normalize them to length, and we will pick the closest match later. 15 character adds/edits/deletes seems very high, and since we've minimized the blank first/last name problem by scoring names separately, we could probably now pick a max of 3-4 if you wanted (scoring anything else as a 0).

// Typos on first letter are much more rare.  Max score 0.3
public static final double MAX_SCORE_FOR_NO_FIRST_LETTER_MATCH = 0.3;

public double scoreName(String searchName, String candidateName) {
  if (searchName.equals(candidateName)) return 1.0

  int editDistance = StringUtils.getLevenshteinDistance(
      searchName, candidateName);

  // Normalize for length:
  double score =
      (candidateName.length() - editDistance) / candidateName.length();

  // Artificially reduce the score if the first letters don't match
  if (searchName.charAt(0) != candidateName.charAt(0)) {
    score = Math.min(score, MAX_SCORE_FOR_NO_FIRST_LETTER_MATCH);
  }

  // Try Soundex or other matching here.  Remember that you don't want
  // to go above 1.0, so you may want to create a second score and
  // return the higher.

  return Math.max(0.0, Math.min(score, 1.0));
}

As noted above, you could plug-in third party or other word-matching algorithms and gain from the shared knowledge of all of them.

Now, we go through the whole list and score every name. Note that I've added a spot for "tweaks". Tweaks may include:

  • Reversal: If the PersonDO is "Benjamin Franklin", but the CSV sheet may contain "Franklin, Benjamin", then you will want to correct for reversed names. In this case you will probably want to add a method checkForReversal that would score the name in reverse and take that score if it is significantly higher. If it matched in reverse exactly, you would give it a 1.0 score.
  • Abbreviations: You may want to give the score a bonus bump if either first/last names matches identically and the other is fully contained in the candidate (or vice versa). This might indicate an abbreviation, like "Samantha/Sam".
  • Common nicknames: You could add a set of known nicknames ("Robert -> Bob, Rob, Bobby, Robby") and then score the search name against all of them and take the highest score. If it matches any of these, you would probably give it a 1.0 score.

As you can see building this as a series of API's gives us logical locations to easily tweak this to our heart's content.

On with the alogrithm:

public static final double MIN_SCORE = 0.3;

public List<Match> findMatches(String searchFirst, String searchLast) {
  List<Match> results = new ArrayList<Match>();

  // Keep in mind that this doesn't scale well.
  // With only 1000 names that's not even a concern a little bit, but
  // thinking ahead, here are two ideas if you need to:
  // - Keep a map of firstnames.  Each entry should be a map of last names.
  //   Then, only iterate through last names if the firstname score is high
  //   enough.
  // - Score each unique first or last name only once and cache the score.
  for(PersonDO person: personCache.values()) {
    // Some of my own ideas follow, you can tweak based on your
    // knowledge of the data)

    // No reason to deal with the combined name, that just makes things
    // more fuzzy (like your problem of too-high scores when one name
    // is completely missing).
    // So, score each name individually.

    double scoreFirst = scoreName(searchFirst, person.getFirstName());
    double scoreLast = scoreName(searchLast, person.getLastName());

    double score = (scoreFirst + scoreLast)/2.0;

    // Add tweaks or alternate scores here.  If you do alternates, in most
    // cases you'll probably want to take the highest, but you may want to
    // average them if it makes more sense.

    if (score > MIN_SCORE) {
      results.add(new Match(person, score));
    }
  }

  return ImmutableList.copyOf(results);
}

Now we modify your findClosestMatch to get just the highest one from all matches (throws NoSuchElementException if none in list).

Possible tweaks:

  • You may want to check if multiple names scored very close, and either report the runners-up (see below), or skip the row for manual choice later.
  • You may want to report how many other matches there were (if you have a very tight scoring algorithm).

Code:

public Match findClosestMatch(String searchFirst, String searchLast) {
  List<Match> matches = findMatch(searchFirst, searchLast);

  // Tweak here

  return Match.SCORE_ORDER.max(list);
}

.. and then modify our original getter:

public PersonDO findPersonDO(String searchFirst, String searchLast) {
  PersonDO person = personCache.get(getPersonKey(searchFirst, searchLast));
  if (person == null) {
    Match match = findClosestMatch(searchFirst, searchLast);
    // Do something here, based on score.
    person = match.getCandidate();
  }
  return person;
}

4. Report "fuzziness" differently.

Finally, you'll notice that findClosestMatch doesn't just return a person, it returns a Match — This is so that we can modify the program to treat fuzzy matches differently from exact matches.

Some things you probably want to do with this:

  • Report guesses: Save all names that matched based on fuzziness into a list so that you can report those and they can be audited later.
  • Validate first: You may want to add a control to turn on and off whether it actually uses the fuzzy matches or just reports them so that you can massage the data before it comes in.
  • Data defenesiveness: You may want to qualify any edits made on a fuzzy match as "uncertain". For example, you could disallow any "major edits" to a Person record if the match was fuzzy.

Conclusion

As you can see it's not too much code to do this yourself. It's doubtful there is ever going to be a library that will predict names as well as you can knowing the data yourself.

Building this in pieces as I've done in the example above will allow you to iterate and tweak easily and even plug in third-party libraries to improve your scoring instead of depending on them entirely -- faults and all.

Nicole
  • 32,841
  • 11
  • 75
  • 101
  • I never did go back and removing the hashing; originally I planned on adding several other attributes to the Hash but I found out they weren't consistent between the two sources so I ended up back at firstname+lastname. Great answer! – Durandal Jan 13 '14 at 18:03
2
  1. Use you db to perform search ? Using regular expression in your select, or use LIKE operator

  2. Analyze your database and try to build or Huffman-tree or Several table to perform a Key-value search.

Giorgio Desideri
  • 169
  • 1
  • 12
2

There is no the best solution, anyhow you have to deal with some sort of heuristics. But you can look for another Levenshtein distance implementation (or implement it by yourself). This implementation must give different scores to different character operations (insertion, deletion) for different characters. For example, you can give lower scores for pairs of characters that are close on the keyboard. Also, you may dynamically compute maximum distance threshold based on a string length.

And I have a performance tip for you. Each time you compute Levenshtein distance, n * m operations are performed, where n and m are lengths of strings. There is Levenshtein automaton which you build once and then evaluate very fast for each string. Be careful, as NFA is very expensive to evaluate, you need first convert it to DFA.

May be you should take a look to Lucene. I hope it includes all fuzzy search capabilities you need. Of you even can use your DBMS fulltext search, if it supported. For example, PostgreSQL supports fulltext.

Alexey Andreev
  • 1,980
  • 2
  • 17
  • 29
2

This is what I did with a similar use case:

  • Match the first name and the last name separately, this will do a more precise match and eliminate some of the false positives:
distance("a b", "a c")                        is   33%
max(distance("a", "a"), distance("b", "c"))   is   100%
  • Base your min distance criteria on the input strings' length, i.e. 0 is for strings shorter than 2 symbols, 1 is for strings shorter than 3 symbols.
int length = Math.min(s1.length(), s2.length);

int min;

if(length <= 2) min = 0; else
if(length <= 4) min = 1; else
if(length <= 6) min = 2; else
...

These two should work for your input.

Andrey Chaschev
  • 16,160
  • 5
  • 51
  • 68