18

I'm looking for a way, in Swift 4, to test if a Character is a member of an arbitrary CharacterSet. I have this Scanner class that will be used for some lightweight parsing. One of the functions in the class is to skip any characters, at the current position, that belong to a certain set of possible characters.

class MyScanner {
  let str: String
  var idx: String.Index
  init(_ string: String) {
    str = string
    idx = str.startIndex
  }
  var remains: String { return String(str[idx..<str.endIndex])}

  func skip(charactersIn characters: CharacterSet) {
    while idx < str.endIndex && characters.contains(str[idx])) {
      idx = source.index(idx, offsetBy: 1)
    }
  }
}

let scanner = MyScanner("fizz   buzz fizz")
scanner.skip(charactersIn: CharacterSet.alphanumerics)
scanner.skip(charactersIn: CharacterSet.whitespaces)
print("what remains: \"\(scanner.remains)\"")

I would like to implement the skip(charactersIn:) function so that the above code would print buzz fizz.

The tricky part is characters.contains(str[idx])) in the while - .contains() requires a Unicode.Scalar, and I'm at a loss trying to figure out the next step.

I know I could pass in a String to the skip function, but I'd like to find a way to make it work with a CharacterSet, because of all the convenient static members (alphanumerics, whitespaces, etc.).

How does one test a CharacterSet if it contains a Character?

Steven Grosmark
  • 1,275
  • 1
  • 12
  • 15
  • There is a system class called `NSScanner`, bridged into Swift as `Scanner`. Have you checked it out? – Code Different Aug 25 '17 at 00:44
  • NSScanner sure does look like a wheel I'm re-inventing at first blush. Not crazy about the NS semantics (uses an inout `NSString?` param), but it might do the trick. Out of curiosity, I looked through the [source](https://github.com/apple/swift-corelibs-foundation/blob/master/Foundation/Scanner.swift), and it converts a `String` to an `Array`, and it's `skip` function then just uses `set.contains(UnicodeScalar(currentCharacter)!)`. – Steven Grosmark Aug 25 '17 at 01:57
  • If you don't like the NS semantics of `NSScanner`, then use Foundation's `Scanner`, which doesn't use NS types. Certainly don't define your own class with the name of an existing class. That's just going to be confusing. – Rob Aug 25 '17 at 02:01
  • 1
    Good point re: class name (edited question). Foundation's Scanner still requires an NSString as a receiving inout parameter: `func scanCharacters(from set: CharacterSet, into result: AutoreleasingUnsafeMutablePointer?) -> Bool` – Steven Grosmark Aug 25 '17 at 02:16

3 Answers3

15

Not sure if it's the most efficient way but you can create a new CharSet and check if they are sub/super-sets (Set comparison is rather quick)

let newSet = CharacterSet(charactersIn: "a")
// let newSet = CharacterSet(charactersIn: "\(character)")
print(newSet.isSubset(of: CharacterSet.decimalDigits)) // false
print(newSet.isSubset(of: CharacterSet.alphanumerics)) // true
nathan
  • 9,329
  • 4
  • 37
  • 51
10

Swift 4.2 CharacterSet extension function to check whether it contains Character:

extension CharacterSet {
    func containsUnicodeScalars(of character: Character) -> Bool {
        return character.unicodeScalars.allSatisfy(contains(_:))
    }
}

Usage example:

CharacterSet.decimalDigits.containsUnicodeScalars(of: "3") // true
CharacterSet.decimalDigits.containsUnicodeScalars(of: "a") // false
Vadim Ahmerov
  • 708
  • 9
  • 12
8

I know that you wanted to use CharacterSet rather than String, but CharacterSet does not (yet, at least) support characters that are composed of more than one Unicode.Scalar. See the "family" character (‍‍‍) or the international flag characters (e.g. "" or "") that Apple demonstrated in the string discussion in WWDC 2017 video What's New in Swift. The multiple skin tone emoji also manifest this behavior (e.g. vs ).

As a result, I'd be wary of using CharacterSet (which is a "set of Unicode character values for use in search operations"). Or, if you want to provide this method for the sake of convenience, be aware that it will not work correctly with characters represented by multiple unicode scalars.

So, you might offer a scanner that provides both CharacterSet and String renditions of the skip method:

class MyScanner {
    let string: String
    var index: String.Index

    init(_ string: String) {
        self.string = string
        index = string.startIndex
    }

    var remains: String { return String(string[index...]) }

    /// Skip characters in a string
    ///
    /// This rendition is safe to use with strings that have characters
    /// represented by more than one unicode scalar.
    ///
    /// - Parameter skipString: A string with all of the characters to skip.

    func skip(charactersIn skipString: String) {
        while index < string.endIndex, skipString.contains(string[index]) {
            index = string.index(index, offsetBy: 1)
        }
    }

    /// Skip characters in character set
    ///
    /// Note, character sets cannot (yet) include characters that are represented by
    /// more than one unicode scalar (e.g. ‍‍‍ or  or ). If you want to test
    /// for these multi-unicode characters, you have to use the `String` rendition of
    /// this method.
    ///
    /// This will simply stop scanning if it encounters a multi-unicode character in
    /// the string being scanned (because it knows the `CharacterSet` can only represent
    /// single-unicode characters) and you want to avoid false positives (e.g., mistaking
    /// the Jamaican flag, , for the Japanese flag, ).
    ///
    /// - Parameter characterSet: The character set to check for membership.

    func skip(charactersIn characterSet: CharacterSet) {
        while index < string.endIndex,
            string[index].unicodeScalars.count == 1,
            let character = string[index].unicodeScalars.first,
            characterSet.contains(character) {
                index = string.index(index, offsetBy: 1)
        }
    }

}

Thus, your simple example will still work:

let scanner = MyScanner("fizz   buzz fizz")
scanner.skip(charactersIn: CharacterSet.alphanumerics)
scanner.skip(charactersIn: CharacterSet.whitespaces)
print(scanner.remains)  // "buzz fizz"

But use the String rendition if the characters you want to skip might include multiple unicode scalars:

let family = "\u{200D}\u{200D}\u{200D}"  // ‍‍‍
let boy = ""

let charactersToSkip = family + boy

let string = boy + family + "foobar"  // ‍‍‍foobar

let scanner = MyScanner(string)
scanner.skip(charactersIn: charactersToSkip)
print(scanner.remains)                // foobar

As Michael Waterfall noted in the comments below, CharacterSet has a bug and doesn’t even handle 32-bit Unicode.Scalar values correctly, meaning that it doesn’t even handle single scalar characters properly if the value exceeds 0xffff (including emoji, amongst others). The String rendition, above, handles these correctly, though.

Rob
  • 415,655
  • 72
  • 787
  • 1,044
  • 2
    Interestingly `CharacterSet` doesn't even handle emoji that is represented with a single unicode scalar ( = 128518) however this returns `false`: `CharacterSet(charactersIn: "ABC").contains(UnicodeScalar(128518)!)` – Michael Waterfall Sep 22 '17 at 20:39
  • 1
    Yeah, above and beyond the single scalar limitation of character sets, there's obviously a bug in `CharacterSet`'s handling of 32-bit scalars, handling them like 16-bit scalars. E.g. try looking in your string for `Unicode.Scalar(62982)` (i.e. `128518 && 0xffff`); lol. It all works fine and dandy with 16-bit scalars, but it's a train wreck when you try using 32-bit scalars with values exceeding `UInt16.max`. We should file a bug report. I'm happy to do so, unless you'd rather do it. – Rob Sep 23 '17 at 19:54
  • `index = string.index(index, offsetBy: 1)` there is a "mutating" method for this `string.formIndex(after: &index)` – Leo Dabus Feb 03 '22 at 14:01