0

My requirement is to be able to match two strings that are similar but not an exact match. For example, given the following strings

  • First Name
  • Last Name
  • LName
  • FName

The output should be FirstName, FName and Last Name, LName as they are a logical match. Are there any libraries that I could use to do this? I am using JAVA to achieve this functionality.

Thanks Raam

Raam
  • 10,296
  • 3
  • 26
  • 27
  • The keyword is fuzzy string matching. Though I am not versed in common or built in functionality for this in java, I did find this: http://stackoverflow.com/questions/327513/fuzzy-string-search-in-java – Ben Jul 14 '14 at 18:59
  • Also known as "edit distance". – David Conrad Jul 14 '14 at 19:06
  • 2
    It's worth noting that Levenshtein Distance is not the answer here. If you were looking for pairs with the least Levenshtein Distance, you would match "LName" with "FName", and "First Name" with "Last Name". So whatever method you go for, to get the match you want, it will have to be something _other than_ Levenshtein Distance. – Dawood ibn Kareem Jul 14 '14 at 19:17
  • @DavidWallace My thought. Sounds to me like the good old soundex function but it sucks for other languages than english. – Hannes Jul 14 '14 at 19:35
  • Good call @DavidWallace I misread the question and assumed that First Name, Last Name, LName, and FName were identifiers not actual values. – wckd Jul 15 '14 at 11:05

5 Answers5

4

You could use Apache Commons StringUtils...

http://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringUtils.html#getLevenshteinDistance(java.lang.CharSequence,%20java.lang.CharSequence)

But it's worth noting that this may not be the best algorithm for the specific use-case in the question - I recommend reading some of the other answers here for more ideas.

CupawnTae
  • 14,192
  • 3
  • 29
  • 60
2

According to the example you gave, you should use a modified Levenshtein distance where the penalty for adding spaces is small and the penalty for mismatched characters is larger. This will handle matching abbreviations to the strings that were abbreviated quite well. However that's assuming that you're mainly dealing with aligning abbreviations to corresponding longer versions of the strings. You should elaborate more exactly what kind of matchings you want to perform (e.g. more examples, or some kind of high-level description) if you want a more detailed and pointed answer about what methods you can/should use.

user2566092
  • 4,631
  • 15
  • 20
2

StringUtils is simply best for this - this is one of the examples i found on stackOverflow - as @CupawnTae said already

Below is the one of the simple example i came across

public static Object getTheClosestMatch(Collection<?> collection, Object target) {
    int distance = Integer.MAX_VALUE;
    Object closest = null;
    for (Object compareObject : collection) {
        int currentDistance = StringUtils.getLevenshteinDistance(compareObject.toString(), target.toString());
        if(currentDistance < distance) {
            distance = currentDistance;
            closest = compareObject;
        }
    }
    return closest;
}
Ashish Shetkar
  • 1,414
  • 2
  • 18
  • 35
1

An answer to a really similar question to yours can be found here.

Also, wikipedia has an article on Approximate String Matching that can be found here. If the first link isn't what you're looking for, I would suggest reading the wikipedia article and digging through the sources to find what you need.

Sorry I can't personally be of more help to you, but I really hope that these resources can help you find what you're looking for!

Community
  • 1
  • 1
joe cool
  • 45
  • 5
1

The spell check algorithms use a variant of this algorithm. http://en.wikipedia.org/wiki/Levenshtein_distance. I implemented it in class for a project and it was fairly simple to do so. If you don't want to implement it yourself you can use the name to search for other libraries.

wckd
  • 410
  • 2
  • 9