My goal: given an arbitrary UTF-16 position in a String
, find the corresponding String.Index
that represents the Character
(i.e. the extended grapheme cluster) the specified UTF-16 code unit is a part of.
Example:
(I put the code in a Gist for easy copying and pasting.)
This is my test string:
let str = ""
(Note: to see the string as a single character, you need to read this on a reasonably recent OS/browser combination that can handle the new profession emoji with skin tones introduced in Unicode 9.)
It's a single Character
(grapheme cluster) that consists of four Unicode scalars or 7 UTF-16 code units:
print(str.unicodeScalars.map { "0x\(String($0.value, radix: 16))" })
// → ["0x1f468", "0x1f3fe", "0x200d", "0x1f692"]
print(str.utf16.map { "0x\(String($0, radix: 16))" })
// → ["0xd83d", "0xdc68", "0xd83c", "0xdffe", "0x200d", "0xd83d", "0xde92"]
print(str.utf16.count)
// → 7
Given an arbitrary UTF-16 offset (say, 2), I can create a corresponding String.Index
:
let utf16Offset = 2
let utf16Index = String.Index(encodedOffset: utf16Offset)
I can subscript the string with this index, but if the index doesn't fall on a Character
boundary, the Character
returned by the subscript might not cover the entire grapheme cluster:
let char = str[utf16Index]
print(char)
// →
print(char.unicodeScalars.map { "0x\(String($0.value, radix: 16))" })
// → ["0x1f3fe", "0x200d", "0x1f692"]
Or the subscript operation might even trap (I'm not sure this is intended behavior):
let trappingIndex = String.Index(encodedOffset: 1)
str[trappingIndex]
// fatal error: Can't form a Character from a String containing more than one extended grapheme cluster
You can test if an index falls on a Character
boundary:
extension String.Index {
func isOnCharacterBoundary(in str: String) -> Bool {
return String.Index(self, within: str) != nil
}
}
trappingIndex.isOnCharacterBoundary(in: str)
// → false (as expected)
utf16Index.isOnCharacterBoundary(in: str)
// → true (WTF!)
The Issue:
I think the problem is that this last expression returns true
. The documentation for String.Index.init(_:within:)
says:
If the index passed as
sourcePosition
represents the start of an extended grapheme cluster—the element type of a string—then the initializer succeeds.
Here, utf16Index
doesn't represent the start of an extended grapheme cluster — the grapheme cluster starts at offset 0, not offset 2. Yet the initializer succeeds.
As a result, all my attempts to find the start of the grapheme cluster by repeatedly decrementing the index's encodedOffset
and testing isOnCharacterBoundary
fail.
Am I overlooking something? Is there another way to test if an index falls on the start of a Character
? Is this a bug in Swift?
My environment: Swift 4.0/Xcode 9.0 on macOS 10.13.
Update: Check out the interesting Twitter thread about this question.
Update: I reported the behavior of String.Index.init?(_:within:)
in Swift 4.0 as a bug: SR-5992.