1

I am trying to find the range of substrings in a Tamil string. The range function particularly seems to have issues for substrings ending with the Unicode modifier glyph ‘Tamil pulli’ (U+0BCD).

The range functions work as expected when the substring match is found at the end of a word in the larger string. However, the function doesn’t find the match, when the substring match occurs in the middle of a word in the longer string.

Here is a brief example with two substrings ss1 and ss2 with the string s. While the ‘ss2’ matches correctly in the larger string, the ss1 doesn’t find the match. We have replicated this with several words and confirmed this behaviour of the substring match not being found when the substring ends with the modifier glyph (U+0BCD) and it appears in the middle of a larger word in the string. The Unicode code points for all the 3 variables appear correctly as shown in the output below.

var s = "அவர்கள்"
var ss1 = "அவர்"
var ss2 = "கள்"
if let r2 = s.range(of: ss2)?.lowerBound {
    print("INFO: Success! Found the substring",s[r2...],"in string",s)
} else {
    print("ERROR: substring",ss2,"not found in string",s)
}

if let r1 = s.range(of: ss1)?.upperBound {
    print(s[..<r1])
} else {
    print("ERROR: substring",ss1,"not found in string",s)
}
print(s.unicodeScalars.map { $0 })
print(ss1.unicodeScalars.map { $0 })
print(ss2.unicodeScalars.map { $0 })

Output:
INFO: Success! Found the substring கள் in string அவர்கள்
ERROR: substring அவர் not found in string அவர்கள்
["\u{0B85}", "\u{0BB5}", "\u{0BB0}", "\u{0BCD}", "\u{0B95}", "\u{0BB3}", "\u{0BCD}"]
["\u{0B85}", "\u{0BB5}", "\u{0BB0}", "\u{0BCD}"]
["\u{0B95}", "\u{0BB3}", "\u{0BCD}"]

Update:
The modifiers in Tamil add or remove the vowel sound from the base characters and in that sense they are different from diacritics. In certain, cases they even change the form of the base characters completely. For example, here is the result of some modifiers (with their code points) added to the base character ர (U+0BB0).

ர - ra (consonant vowel; no modifier)
ரா - raa (U+0BBE)
ரி - ri (U+0BBF)
ரு - ru (base char is unrecognizable now; U+0BC1)
ரெ - re (U+0BC6)
ரை - rai (U+0BC8)
ரொ - ro (U+0BCA)
ர் - r (pure consonant; U+0BCD)

So, while the solution by @Larme below works, because it treats the dot glyph in ர் as a diacritic, it leads to invalid/ undesired matches for us.

s = "அவர்கள்”, ss1 = "அவர்" - Matches அவர்
s = "அவரகள்”, ss1 = "அவர்" - Matches அவர (undesired)
s = "அவராகள்”, ss1 = "அவர்" - Matches அவரா (undesired)
s = "அவரிகள்”, ss1 = "அவர்" - Matches அவர (undesired)
s = "அவரைகள்”, ss1 = "அவர்" - Matches அவர (undesired)

baskaran
  • 25
  • 1
  • 7
  • 1
    I don't know Tamil, so why it would make sense (since it's not a diacritic I know of), but with `s.range(of: ss1, options: .diacriticInsensitive)` instead of `s.range(of: ss1)` it seems to works. – Larme Aug 19 '23 at 13:57
  • Ok cool. It did work. Thanks a lot. But what I don't understand is, why is this an issue only in the specific case of the substring match in positions other than the end of a word in the larger string. Also, this problem doesn't occur with 10 other modifiers in Tamil as far as we've tested. – baskaran Aug 19 '23 at 15:59
  • @baskaran You should use `localizedStandardRange` – Leo Dabus Aug 19 '23 at 23:06
  • @Larme @LeoDabus While both `, options: .diacriticInsensitive` and `localizedStandardRange` are giving a way to overcome the issue for the specific case, they'll lead to several undesired/ invalid matches for us. Please see my **Update** to the original question above. But, it seems to me like the modifiers in Tamil (and other Indian languages) should **not** be treated like diacritics in Swift. – baskaran Aug 20 '23 at 01:57
  • @baskaran check my post below – Leo Dabus Aug 21 '23 at 01:29

1 Answers1

1

If localizedStandardRange doesn't meet your requirements, you will need to implement your own method to find the range of your substrings. Maybe something like as shown in this post:

extension Collection where Element: Equatable {
    func findRange<C: Collection>(of collection: C) -> Range<Index>? where C.Element == Element {
        guard !collection.isEmpty else { return nil }
        let size = collection.count
        var range: Range<Index>!
        guard let _ = indices.dropLast(size-1).first(where: {
            range = $0..<index($0, offsetBy: size)
            return self[range].elementsEqual(collection)
        }) else {
            return nil
        }
        return range
    }
    func containsSubsequence<C: Collection>(_ collection: C) -> Bool where C.Element == Element  {
        guard !collection.isEmpty else { return false }
        let size = collection.count
        for i in indices.dropLast(size-1) where self[i..<index(i, offsetBy: size)].elementsEqual(collection) {
            return true
        }
        return false
    }
}
Leo Dabus
  • 229,809
  • 59
  • 489
  • 571