2
let regex1 = "(\\ud83d\\udc68)"
let regex2 = "(\\ud83d[\\udc68-\\udc69])"

"".capturedGroupsFull(forRegex: regex1)
// returns 1 match: [(.0 "", .1 {0, 2})]
"".capturedGroupsFull(forRegex: regex2)
// returns nil

Why is the first line returning one match and the second line no match?

  • Both regular expressions work fine on regex101 (e.g. set to javascript and use second regex as (\ud83d[\udc68-\udc69])).
  • I am working with Swift 4.0.
  • This regex "(\\ud83d[\\udc68])" will also return nil when testing in Playground.

Below you can find the full code I use to retrieve the matches.

extension String {
    func capturedGroupsFull(forRegex regex: String) -> [(String, NSRange)]? {
        let expression: NSRegularExpression
        do {
            expression = try NSRegularExpression(pattern: regex, options: [.caseInsensitive])
        } catch {
            return nil
        }
        let nsString = self as NSString
        let matches = expression.matches(in: self, options: [], range: NSRange(location:0, length: nsString.length))
        guard let match = matches.first else { return nil }
        var results = [(String, NSRange)]()
        for match in matches {
            let range = match.range
            let matchedString = nsString.substring(with: range)
            results.append((matchedString, range))
        }
        return results
    }
}
christopher.online
  • 2,614
  • 3
  • 28
  • 52

1 Answers1

2

Why is the first line returning one match and the second line no match?

As already commented, NSRegularExpression works on Unicode code points and (normal) JavaScript regex works on UTF-16 code units.

Some patterns like "\\ud83d\\udc68", which is made of a valid surrogate pair, may be optimized to a single Unicode code point U+1F468, but this feature is not well-documented, so you should not rely on it, as you found in the example "(\\ud83d[\\udc68])".


I recommend not to use surrogate pair with \uhhhh, but use \UHHHHHHHH (or \x{hhhh}) for non-BMP characters.

let regex1 = "(\\U0001F468)" //or "(\\x{1F468})"
let regex2 = "([\\U0001F468-\\U0001F469])" // or "([\\x{1F468}-\\x{1F469}])"

"".capturedGroupsFull(forRegex: regex1)
// -> [(.0 "", .1 {0, 2})]
"".capturedGroupsFull(forRegex: regex2)
// -> [(.0 "", .1 {0, 2})]

Recent JavaScript regex accepts u option to make it work with Unicode code points, try these:

/(\u{1F468})/u
/([\u{1F468}-\u{1F469}])/u

You can easily test your regex pattern with JavaScript syntax and convert it to NSRegularExpression syntax with replacing \u to \x ( and removing / and /u).

OOPer
  • 47,149
  • 6
  • 107
  • 142