I am trying to find the range of substrings in a Tamil string. The range function particularly seems to have issues for substrings ending with the Unicode modifier glyph ‘Tamil pulli’ (U+0BCD).
The range functions work as expected when the substring match is found at the end of a word in the larger string. However, the function doesn’t find the match, when the substring match occurs in the middle of a word in the longer string.
Here is a brief example with two substrings ss1 and ss2 with the string s. While the ‘ss2’ matches correctly in the larger string, the ss1 doesn’t find the match. We have replicated this with several words and confirmed this behaviour of the substring match not being found when the substring ends with the modifier glyph (U+0BCD) and it appears in the middle of a larger word in the string. The Unicode code points for all the 3 variables appear correctly as shown in the output below.
var s = "அவர்கள்"
var ss1 = "அவர்"
var ss2 = "கள்"
if let r2 = s.range(of: ss2)?.lowerBound {
print("INFO: Success! Found the substring",s[r2...],"in string",s)
} else {
print("ERROR: substring",ss2,"not found in string",s)
}
if let r1 = s.range(of: ss1)?.upperBound {
print(s[..<r1])
} else {
print("ERROR: substring",ss1,"not found in string",s)
}
print(s.unicodeScalars.map { $0 })
print(ss1.unicodeScalars.map { $0 })
print(ss2.unicodeScalars.map { $0 })
Output:
INFO: Success! Found the substring கள் in string அவர்கள்
ERROR: substring அவர் not found in string அவர்கள்
["\u{0B85}", "\u{0BB5}", "\u{0BB0}", "\u{0BCD}", "\u{0B95}", "\u{0BB3}", "\u{0BCD}"]
["\u{0B85}", "\u{0BB5}", "\u{0BB0}", "\u{0BCD}"]
["\u{0B95}", "\u{0BB3}", "\u{0BCD}"]
Update:
The modifiers in Tamil add or remove the vowel sound from the base characters and in that sense they are different from diacritics. In certain, cases they even change the form of the base characters completely. For example, here is the result of some modifiers (with their code points) added to the base character ர (U+0BB0).
ர - ra (consonant vowel; no modifier)
ரா - raa (U+0BBE)
ரி - ri (U+0BBF)
ரு - ru (base char is unrecognizable now; U+0BC1)
ரெ - re (U+0BC6)
ரை - rai (U+0BC8)
ரொ - ro (U+0BCA)
ர் - r (pure consonant; U+0BCD)
So, while the solution by @Larme below works, because it treats the dot glyph in ர் as a diacritic, it leads to invalid/ undesired matches for us.
s = "அவர்கள்”, ss1 = "அவர்" - Matches அவர்
s = "அவரகள்”, ss1 = "அவர்" - Matches அவர (undesired)
s = "அவராகள்”, ss1 = "அவர்" - Matches அவரா (undesired)
s = "அவரிகள்”, ss1 = "அவர்" - Matches அவர (undesired)
s = "அவரைகள்”, ss1 = "அவர்" - Matches அவர (undesired)