0

I'm trying to tokenize a string into words in a Cocoa app, but ran into a problem with NLTokenizer.

When the input string starts with a symbol from the Unicode category "Other Symbol" or the block "Specials", like the NSTextAttachment.character, tokenizing fails (i.e. returns empty list).

The problem only occurs when the symbol is followed directly by a word without a space (see examples below).

Use case:

I have an NSAttributedString that can contain images anywhere in the text. Those are represented internally by the Object Replacement Character (U+FFFC). When a document starts with an image followed directly by a word, not a space, tokenizing fails.

To reproduce:

/// Splits by natural language words.
static let tokenizeByWord:(String)-> [String] = { input in
    
    let tokenizer = NLTokenizer(unit: .word)
    tokenizer.string = input
    
    var tokens = [String]()
    
    tokenizer.enumerateTokens(in: input.startIndex..<input.endIndex) { tokenRange, _ in
        let token = input[tokenRange]
        tokens.append(String(token))
        return true
    }
    return tokens
}
//  These all fail: (string starts with symbol, followed by word)
XCTAssertEqual(tokenizeByWord("\u{FFFC}hello world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("©hello world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("®hello world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("|hello world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("\\hello world"), ["hello", "world"])

// ✅ These all pass: (space after symbol)
XCTAssertEqual(tokenizeByWord("\u{FFFC} hello world"), ["\u{FFFC}", "hello", "world"])
XCTAssertEqual(tokenizeByWord("© hello world"), ["©", "hello", "world"])
XCTAssertEqual(tokenizeByWord("® hello world"), ["®", "hello", "world"])
XCTAssertEqual(tokenizeByWord("| hello world"), ["|", "hello", "world"])
XCTAssertEqual(tokenizeByWord("\\ hello world"), ["\\", "hello", "world"])

// ✅ These all pass: (no space, but symbol rigth before second word)
XCTAssertEqual(tokenizeByWord("hello \u{FFFC}world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("hello ©world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("hello ®world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("hello |world"), ["hello", "world"])
XCTAssertEqual(tokenizeByWord("hello \\world"), ["hello", "world"])

// ✅ Emoji pass with and without space:
XCTAssertEqual(tokenizeByWord("hello world" ), ["", "hello", "world"])
XCTAssertEqual(tokenizeByWord(" hello world"), ["", "hello", "world"])

System:

  • macOS Catalina 10.15.7 (19H2)
  • Xcode 12.4 (12D4e)
Mark
  • 6,647
  • 1
  • 45
  • 88
  • What is `TokenizerStrategy` and what does `tokenize` do? – Willeke Apr 15 '21 at 09:41
  • It is simply a function wrapper that takes a String and returns a list of tokens. I updated the source code to make it more clear. – Mark Apr 15 '21 at 16:53
  • I tried your code and `tokenizeByWord(" world")` returns `["", "world"]`. – Willeke Apr 15 '21 at 22:20
  • You're right. Looks like I mixed up the sample code. Turns out emojis work fine, but it still fails for symbols in the Unicode "Other Symbols" category like this: `"©hello world"`. I updated the question, could you please take another look? The key seems to be that the string starts with the special symbol followed directly by a word, not a whitespace. – Mark Apr 16 '21 at 09:29
  • It looks like a bug to me. Try `" hello world"` and `"hello © world"`. Possible workaround: use `CFStringTokenizer` but avoid `CFStringTokenizerGoToTokenAtIndex`. – Willeke Apr 16 '21 at 14:53
  • Ok, thanks. I'll file a bug report. – Mark Apr 20 '21 at 09:07

0 Answers0