1

I need to do fuzzy comparison of a large number of strings and am looking at Jaro-Winkler which respects differences in the order of letters. Is anyone aware of a way to do this in Objective-C or Swift either using Jaro-Winkler or some method native to IOS?

Thanks for any recommendations or suggestions.

Rob
  • 415,655
  • 72
  • 787
  • 1,044
user6631314
  • 1,751
  • 1
  • 13
  • 44
  • 1
    I do not believe that there is any native implementation. You’ll have to implement it yourself or find some third party library. Unfortunately, the latter is off-topic for Stack Overflow. – Rob Feb 26 '19 at 17:02
  • 1
    Objective-C is C, and any `NSString` can be converted into a C string, so a C-based implementation could be used to generate the metric. Try GitHub. – James Bucanek Feb 26 '19 at 19:16

1 Answers1

3

I took an inspiration in Apache Commons and rewritten it to Swift:

extension String {
    static func jaroWinglerDistance(_ first: String, _ second: String) -> Double {
        let longer = Array(first.count > second.count ? first : second)
        let shorter = Array(first.count > second.count ? second : first)

        let (numMatches, numTranspositions) = jaroWinklerData(longer: longer, shorter: shorter)

        if numMatches == 0 {
            return 0
        }

        let defaultScalingFactor = 0.1;
        let percentageRoundValue = 100.0;

        let jaro = [
            numMatches / Double(first.count),
            numMatches / Double(second.count),
            (numMatches - numTranspositions) / numMatches
        ].reduce(0, +) / 3

        let jaroWinkler: Double

        if jaro < 0.7 {
            jaroWinkler = jaro
        } else {
            let commonPrefixLength = Double(commonPrefix(first, second).count)
            jaroWinkler = jaro + Swift.min(defaultScalingFactor, 1 / Double(longer.count)) * commonPrefixLength * (1 - jaro)
        }

        return round(jaroWinkler * percentageRoundValue) / percentageRoundValue
    }

    private static func commonPrefix(_ first: String, _ second: String) -> String{
        return String(
            zip(first, second)
                .prefix { $0.0 == $0.1 }
                .map { $0.0 }
        )
    }

    private static func jaroWinklerData(
        longer: Array<Character>,
        shorter: Array<Character>
    ) -> (numMatches: Double, numTranspositions: Double) {
        let window = Swift.max(longer.count / 2 - 1, 0)

        var shorterMatchedChars: [Character] = []
        var longerMatches = Array<Bool>(repeating: false, count: longer.count)

        for (offset, shorterChar) in shorter.enumerated() {
            let windowRange = Swift.max(offset - window, 0) ..< Swift.min(offset + window + 1, longer.count)
            if let matchOffset = windowRange.first(where: { !longerMatches[$0] && shorterChar == longer[$0] }) {
                shorterMatchedChars.append(shorterChar)
                longerMatches[matchOffset] = true
            }
        }

        let longerMatchedChars = longerMatches
            .enumerated()
            .filter { $0.element }
            .map { longer[$0.offset] }

        let numTranspositions: Int = zip(shorterMatchedChars, longerMatchedChars)
            .lazy
            .filter { $0.0 != $0.1 }
            .count / 2

        return (
            numMatches: Double(shorterMatchedChars.count),
            numTranspositions: Double(numTranspositions)
        )
    }
}

Tested by the examples found in the original code:

print(String.jaroWinglerDistance("", ""))
print(String.jaroWinglerDistance("", "a"))
print(String.jaroWinglerDistance("aaapppp", ""))
print(String.jaroWinglerDistance("frog", "fog"))
print(String.jaroWinglerDistance("fly", "ant"))
print(String.jaroWinglerDistance("elephant", "hippo"))
print(String.jaroWinglerDistance("hippo", "elephant"))
print(String.jaroWinglerDistance("hippo", "zzzzzzzz"))
print(String.jaroWinglerDistance("hello", "hallo"))
print(String.jaroWinglerDistance("ABC Corporation", "ABC Corp"))
print(String.jaroWinglerDistance("D N H Enterprises Inc", "D & H Enterprises, Inc."))
print(String.jaroWinglerDistance("My Gym Children's Fitness Center", "My Gym. Childrens Fitness"))
print(String.jaroWinglerDistance("PENNSYLVANIA", "PENNCISYLVNIA"))

I have also found another implementation of String similarity functions in github.

Sulthan
  • 128,090
  • 22
  • 218
  • 270
  • @Ivan It's Swift 4.2 therefore the changes would be minimal to Swift 5. However, it uses a `String` extension to access characters by an `Int` index, which would affect performance. – Sulthan Jun 06 '19 at 13:38