Strange Behavior In CharacterSet.contains() Method, With High UTF8 Characters Mixed With ASCII

Question

Here's the deal: I am creating a StringProtocol extension to add the ability to do a split, based on a character set (any character in the set is used to split -greedy comparison).

The issue is that I am having difficulties comparing against a CharacterSet that has BOTH low-number ASCII characters AND high-number UTF8 characters.

If I present only UTF8 high or ASCII, the match works fine.

I created a playground that illustrates this.

The strange result is the second-to-last printout ("Test String 2 does not have a space or a joker."). That should say "does."

The issue is that the space in the CharacterSet matches, but the joker card does not.

Any ideas? Here's the playground:

import Foundation

public extension StringProtocol {
    func containsOneOfThese(_ inCharacterset: CharacterSet) -> Bool {
        self.contains { (char) in
            char.unicodeScalars.contains { (scalar) in inCharacterset.contains(scalar) }
        }
    }
}

let space = " "
let joker = ""
let both = space + joker

let spadesNumberCards = ""
let spadesFaceCards = ""

let testString1 = spadesNumberCards + space + spadesFaceCards
let testString2 = spadesNumberCards + joker + spadesFaceCards
let testString3 = spadesNumberCards + both + spadesFaceCards

print("These Are The Strings We Are Testing:\n")
print("Test String 1: \"\(testString1)\"")
print("Test String 2: \"\(testString2)\"")
print("Test String 3: \"\(testString3)\"")
      
print("\nFirst, See If Any Of the Strings Contain Spaces:\n")
print("Test String 1 does \(testString1.containsOneOfThese(CharacterSet(charactersIn: space)) ? "" : "not ")have a space.")
print("Test String 2 does \(testString2.containsOneOfThese(CharacterSet(charactersIn: space)) ? "" : "not ")have a space.")
print("Test String 3 does \(testString3.containsOneOfThese(CharacterSet(charactersIn: space)) ? "" : "not ")have a space.")

print("\nNext, See If Any Of the Strings Contain Jokers:\n")
print("Test String 1 does \(testString1.containsOneOfThese(CharacterSet(charactersIn: joker)) ? "" : "not ")have a joker.")
print("Test String 2 does \(testString2.containsOneOfThese(CharacterSet(charactersIn: joker)) ? "" : "not ")have a joker.")
print("Test String 3 does \(testString3.containsOneOfThese(CharacterSet(charactersIn: joker)) ? "" : "not ")have a joker.")

print("\nOK, Now it gets weird:\n")

print("Test String 1 does \(testString1.containsOneOfThese(CharacterSet(charactersIn: both)) ? "" : "not ")have a space or a joker.")
print("Test String 2 does \(testString2.containsOneOfThese(CharacterSet(charactersIn: both)) ? "" : "not ")have a space or a joker.")
print("Test String 3 does \(testString3.containsOneOfThese(CharacterSet(charactersIn: both)) ? "" : "not ")have a space or a joker.")

Which prints out:

These Are The Strings We Are Testing:

Test String 1: " "
Test String 2: ""
Test String 3: " "

First, See If Any Of the Strings Contain Spaces:

Test String 1 does have a space.
Test String 2 does not have a space.
Test String 3 does have a space.

Next, See If Any Of the Strings Contain Jokers:

Test String 1 does not have a joker.
Test String 2 does have a joker.
Test String 3 does have a joker.

OK, Now it gets weird:

Test String 1 does have a space or a joker.
Test String 2 does not have a space or a joker.
Test String 3 does have a space or a joker.

I have observed before that `CharacterSet` does not work well with characters outside of the BMP (basic multilingual plane), perhaps due its heritage from `NSCharacterSet`. You might be better off using a `Set` instead. — Martin R, Jul 09 '20 at 19:58
Thanks! Let me give that a try. The whole deal is to make it easy to use, but turning a simple String into a Set should be ok. — Chris Marshall, Jul 09 '20 at 20:04
Actually it seems that `CharacterSet(charactersIn: both)` is broken. With `CharacterSet(both.unicodeScalars)` you get the expected result. — Martin R, Jul 09 '20 at 20:07
Oohhh...so the converter is broken. I may still look at using a Set, or even a simple brute-force comparison (This is not meant for industrial use). But your suggestion is cool. — Chris Marshall, Jul 09 '20 at 20:10
Yup. You right. If you phrase that as an answer, I'll greencheck you. Thanks! — Chris Marshall, Jul 09 '20 at 20:12

Martin R · Accepted Answer · 2020-07-10T07:38:37.947

1

It seems that CharacterSet.init(charactersIn string: String) does not work correctly if the string contains characters from both inside and outside the BMP (basic multilingual plane):

let s = " "
let cs = CharacterSet(charactersIn: s)
s.unicodeScalars.forEach {
    print(cs.contains($0))
}

// Expected output: true, true
// Actual output:   true, false

A workaround is to use create the character set from the sequence of Unicode scalars instead:

let cs = CharacterSet(s.unicodeScalars)

This will produce the expected output.

But note that this cannot handle the full range of Swift Characters (which include grapheme clusters consisting of multiple Unicode scalars). Therefore you might want to work with a Set<Character> instead.

edited Jul 10 '20 at 07:38

answered Jul 09 '20 at 20:16

Martin R

529,903
94
1,240
1,382

Woo-Hoo! Thanks for the fast turnaround! – Chris Marshall Jul 09 '20 at 20:19
1

@MartinR Should `CharacterSet` have been called `UnicodeScalarSet`? From what I can tell, the "Character" in "CharacterSet" doesn't mean "Character" as in `Swift.Character`. – Alexander Jul 09 '20 at 20:28
Good point. @MartinR 's point about the grapheme clusters is worth thinking about, as well. I may not do that for this utility (yet), but it's an issue. – Chris Marshall Jul 09 '20 at 20:30
1

@Alexander-ReinstateMonica: It is named CharacterSet because it is the overlay value type (in the sense of [SE-0069](https://github.com/apple/swift-evolution/blob/master/proposals/0069-swift-mutability-for-foundation.md)) of NSCharacterSet. Renaming it has been [discussed and rejected](https://forums.swift.org/t/pitch-renaming-characterset-to-unicodescalarset/4135). – Martin R Jul 09 '20 at 20:35
1

@MartinR Oh interesting, I called it! :p It's a shame, because it really has some "surprises" and broken expectations. Rationale: "We made a decision to leave the names of the types the same between Swift and Foundation. It’s a tradeoff for sure, but it seems better than other alternatives. Consistent documentation and hindering a common understanding of purpose for the type would be the biggest challenge if we change the names." – Alexander Jul 09 '20 at 20:37

Strange Behavior In CharacterSet.contains() Method, With High UTF8 Characters Mixed With ASCII

1 Answers1