From any UTF-16 offset, find the corresponding String.Index that lies on a Character boundary

Question

My goal: given an arbitrary UTF-16 position in a String, find the corresponding String.Index that represents the Character (i.e. the extended grapheme cluster) the specified UTF-16 code unit is a part of.

Example:

(I put the code in a Gist for easy copying and pasting.)

This is my test string:

let str = "‍"

(Note: to see the string as a single character, you need to read this on a reasonably recent OS/browser combination that can handle the new profession emoji with skin tones introduced in Unicode 9.)

It's a single Character (grapheme cluster) that consists of four Unicode scalars or 7 UTF-16 code units:

print(str.unicodeScalars.map { "0x\(String($0.value, radix: 16))" })
// → ["0x1f468", "0x1f3fe", "0x200d", "0x1f692"]
print(str.utf16.map { "0x\(String($0, radix: 16))" })
// → ["0xd83d", "0xdc68", "0xd83c", "0xdffe", "0x200d", "0xd83d", "0xde92"]
print(str.utf16.count)
// → 7

Given an arbitrary UTF-16 offset (say, 2), I can create a corresponding String.Index:

let utf16Offset = 2
let utf16Index = String.Index(encodedOffset: utf16Offset)

I can subscript the string with this index, but if the index doesn't fall on a Character boundary, the Character returned by the subscript might not cover the entire grapheme cluster:

let char = str[utf16Index]
print(char)
// → ‍
print(char.unicodeScalars.map { "0x\(String($0.value, radix: 16))" })
// → ["0x1f3fe", "0x200d", "0x1f692"]

Or the subscript operation might even trap (I'm not sure this is intended behavior):

let trappingIndex = String.Index(encodedOffset: 1)
str[trappingIndex]
// fatal error: Can't form a Character from a String containing more than one extended grapheme cluster

You can test if an index falls on a Character boundary:

extension String.Index {
    func isOnCharacterBoundary(in str: String) -> Bool {
        return String.Index(self, within: str) != nil
    }
}

trappingIndex.isOnCharacterBoundary(in: str)
// → false (as expected)
utf16Index.isOnCharacterBoundary(in: str)
// → true (WTF!)

The Issue:

I think the problem is that this last expression returns true. The documentation for String.Index.init(_:within:) says:

If the index passed as sourcePosition represents the start of an extended grapheme cluster—the element type of a string—then the initializer succeeds.

Here, utf16Index doesn't represent the start of an extended grapheme cluster — the grapheme cluster starts at offset 0, not offset 2. Yet the initializer succeeds.

As a result, all my attempts to find the start of the grapheme cluster by repeatedly decrementing the index's encodedOffset and testing isOnCharacterBoundary fail.

Am I overlooking something? Is there another way to test if an index falls on the start of a Character? Is this a bug in Swift?

My environment: Swift 4.0/Xcode 9.0 on macOS 10.13.

Update: Check out the interesting Twitter thread about this question.

Update: I reported the behavior of String.Index.init?(_:within:) in Swift 4.0 as a bug: SR-5992.

It seems that `String.Index(_:within:)` does not treat Emoji sequences as a single grapheme cluster (even if Swift 4 is based on Unicode 9). — Martin R, Sep 25 '17 at 16:47

Martin R · Accepted Answer · 2017-09-25T18:44:49.893

A possible solution, using the rangeOfComposedCharacterSequence(at:) method:

extension String {
    func index(utf16Offset: Int) -> String.Index? {
        guard utf16Offset >= 0 && utf16Offset < utf16.count else { return nil }
        let idx = String.Index(encodedOffset: utf16Offset)
        let range = rangeOfComposedCharacterSequence(at: idx)
        return range.lowerBound
    }
}

Example:

let str = "a‍bcd‍‍‍e"
for utf16Offset in 0..<str.utf16.count {
    if let idx = str.index(utf16Offset: utf16Offset) {
        print(utf16Offset, str[idx])
    }
}

Output:

0 a
1 ‍
2 ‍
3 ‍
4 ‍
5 ‍
6 ‍
7 ‍
8 b
9 
10 
11 
12 
13 c
14 
15 
16 d
17 ‍‍‍
18 ‍‍‍
19 ‍‍‍
20 ‍‍‍
21 ‍‍‍
22 ‍‍‍
23 ‍‍‍
24 ‍‍‍
25 ‍‍‍
26 ‍‍‍
27 ‍‍‍
28 e

Thanks! This is a very nice solution that I did not think of. It even seem to work with indices from the UTF-8 view, so it's not limited to UTF-16 offset.s — Ole Begemann, Sep 26 '17 at 09:57

score 1 · Answer 2 · answered Jul 03 '18 at 07:41

1

This has been fixed in Swift 4.1.

answered Jul 03 '18 at 07:41

Ole Begemann

135,006
31
278
256

From any UTF-16 offset, find the corresponding String.Index that lies on a Character boundary

2 Answers2

Linked