2

I have a regular expression that is supposed to allow me to annotate pieces of code within markdown documents. Basically it looks for content between /*HLS*/ and /*HLE*/ comments, and wraps that in a span. It even allows for a small explanation that'll become the title of the span.

import Foundation

let content = """
extension ViewController: UITableViewDataSource {
  func tableView(_ tableView: UITableView, numberOfRowsInSection section: Int) -> Int {
    return /*HLS Explanation here!*/viewModel.books.value.count/*HLE*/
  }

  func tableView(_ tableView: UITableView, cellForRowAt indexPath: IndexPath) -> UITableViewCell {
    let book = /*HLS*/viewModel.books.value[indexPath.row]/*HLE*/
    let cell = tableView.dequeueReusableCell(withIdentifier: "BookCell") as! BookCell
    cell.configure(with: book)
    return cell
  }
}
"""

let regex = try NSRegularExpression(pattern: #"(?s)\/\*HLS\W?(.*?)\*\/(.*?)\/\*HLE\*\/"#)
let range = NSRange(content.startIndex..<content.endIndex, in: content)

let newContent = regex.stringByReplacingMatches(in: content, options: [], range: range, withTemplate: #"<span class="highlight" title="$1">$2</span>"#)
print(newContent)

The result:

extension ViewController: UITableViewDataSource {
  func tableView(_ tableView: UITableView, numberOfRowsInSection section: Int) -> Int {
    return <span class="highlight" title="Explanation here!">viewModel.books.value.count</span>
  }

  func tableView(_ tableView: UITableView, cellForRowAt indexPath: IndexPath) -> UITableViewCell {
    let book = <span class="highlight" title="">viewModel.books.value[indexPath.row]</span>
    let cell = tableView.dequeueReusableCell(withIdentifier: "BookCell") as! BookCell
    cell.configure(with: book)
    return cell
  }
}

This is exactly how it is supposed to work

However, when I remove that Explanation here! from the first comment, the regex is too greedy.

import Foundation

let content = """
extension ViewController: UITableViewDataSource {
  func tableView(_ tableView: UITableView, numberOfRowsInSection section: Int) -> Int {
    return /*HLS*/viewModel.books.value.count/*HLE*/
  }

  func tableView(_ tableView: UITableView, cellForRowAt indexPath: IndexPath) -> UITableViewCell {
    let book = /*HLS*/viewModel.books.value[indexPath.row]/*HLE*/
    let cell = tableView.dequeueReusableCell(withIdentifier: "BookCell") as! BookCell
    cell.configure(with: book)
    return cell
  }
}
"""

let regex = try NSRegularExpression(pattern: #"(?s)\/\*HLS\W?(.*?)\*\/(.*?)\/\*HLE\*\/"#)
let range = NSRange(content.startIndex..<content.endIndex, in: content)

let newContent = regex.stringByReplacingMatches(in: content, options: [], range: range, withTemplate: #"<span class="highlight" title="$1">$2</span>"#)
print(newContent)

Result:

extension ViewController: UITableViewDataSource {
  func tableView(_ tableView: UITableView, numberOfRowsInSection section: Int) -> Int {
    return <span class="highlight" title="/viewModel.books.value.count/*HLE">
  }

  func tableView(_ tableView: UITableView, cellForRowAt indexPath: IndexPath) -> UITableViewCell {
    let book = /*HLS*/viewModel.books.value[indexPath.row]</span>
    let cell = tableView.dequeueReusableCell(withIdentifier: "BookCell") as! BookCell
    cell.configure(with: book)
    return cell
  }
}

As you can see, viewModel.books.value.count/*HLE becomes the title, and then everything until the second /*HLE*/ gets wrapped. The regex should match the title capture group until that very first */ it encounters, but it's not - it goes until the second one. Why is that? The regex should match (.*?) until it encounters \*\/, right?

When I remove the (?s) flag everything works as expected again, but I want to be able to wrap multiple lines between /*HLS*/ and /*HLE*/.

Kevin Renskers
  • 5,156
  • 4
  • 47
  • 95

1 Answers1

1

The problem is with the \W? "non-word" pattern part: it optionally matches any char other than a letter, digit, underscore and some chars like diacritics or connector punctuation and zero-width joiners.

There are a couple of solutions, but you probably just wanted to match any non-word char but a */ substring immediately after HLS. Thus, you can use this immediate fix:

(?s)/\*HLS(?:(?!\*/)\W)?(.*?)\*/(.*?)/\*HLE\*/

See the regex demo. The (?:(?!\*/)\W)? optional (? at the end) non-capturing group ((?:...)) that matches one or zero occurrences of a non-word char that is not a * immediately followed with /.

Note you do not need to escape forward slashes, they are not any special regex metacharacters, and you do not need to escape them in the Swift code as regexps are defined with mere string literals, not with regex literals that often require the /.../ notation (where / are regex delimiters).

If you want to make the pattern safer (exclude matches on "broken" HLS/HLE), you can use a solution like

(?s)/\*HLS(?:(?!\*/)\W)?((?:(?!/\*HLS).)*?)\*/(.*?)/\*HLE\*/

See this regex demo where I added /*HLS into a string literal. The (?:(?!/\*HLS).)*? part matches any char, zero or more but as few as possible occurrences, that does not start the /*HLS char sequence.

Note this whole regex won't work correctly if you have a match inside a string literal.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563