-3

I want to run regex through a html string that has multiple anchor tags and construct a dictionary of link text vs its href url.

<p>This is a simple text with some embedded <a href="http://example.com/link/to/some/page?param1=77&param2=22">links</a>. This is a <a href="https://exmp.le/sample-page/?uu=1">different link</a>.

How do I extract for <a> tag's text and href in one go?

Edit:

func extractLinks(html: String) -> Dictionary<String, String>? {

    do {
        let regex = try NSRegularExpression(pattern: "/<([a-z]*)\b[^>]*>(.*?)</\1>/i", options: [])
        let nsString = html as NSString
        let results = regex.matchesInString(html, options: [], range: NSMakeRange(0, nsString.length))
        return results.map { nsString.substringWithRange($0.range)}
    } catch let error as NSError {
        print("invalid regex: \(error.localizedDescription)")
        return nil
    }
}
Nagendra Rao
  • 7,016
  • 5
  • 54
  • 92
  • 1
    Where is your regular expression code? – matt May 05 '17 at 23:11
  • @matt: They are waiting for you to write it. – l'L'l May 05 '17 at 23:13
  • Its pretty bad. – Nagendra Rao May 05 '17 at 23:14
  • Well, posting something is better than nothing... – l'L'l May 05 '17 at 23:15
  • Added my regex, which clearly isn't working. – Nagendra Rao May 05 '17 at 23:22
  • I come from php background, I am trying to figure out swift style, reading the docs as we speak. – Nagendra Rao May 05 '17 at 23:26
  • Ok, try that regex out in php. Mainly, it uses look ahead/behind assertion and atomic group. But, it works great. –  May 05 '17 at 23:27
  • Not a regex answer but it’s generally considered better to use a proper HTML parser rather than a regex for this kind of work. Take a look at [SwiftSoup](https://github.com/scinfu/SwiftSoup) or [Kanna](https://github.com/tid-kijyun/Kanna) for two of many excellent parsing libraries. –  May 06 '17 at 00:02
  • `/<([a-z]*)\b[^>]*>(.*?)\1>/i` is not a pattern in Swift / Objective-C regex. You need to lose the `/` delimiters and the `i`; this is not Perl! – matt May 06 '17 at 00:25
  • Apart from that, it's great. It captures the `...` and the link text. It makes _no_ attempt to capture the `href` content if that's what you were hoping for, however. Of course it would also capture _any_ html such as `

    this

    ` so perhaps it is too broad for your purposes.
    – matt May 06 '17 at 00:33
  • 1
    I recommend you download [NSRegexTester](https://github.com/aaronvegh/nsregextester) and work up your pattern. Then you'll be read to rock and roll. – matt May 06 '17 at 00:34
  • 1
    I'd suggest that you consider using HTML parser rather than trying to parse with regex. See http://stackoverflow.com/a/1732454/1271826. Consider [TFHpple](https://github.com/topfunky/hpple) or [NDHpple](https://github.com/ndavon/NDHpple). The former is written in Objective-C, but can be used perfectly well from Swift. The latter is written in Swift, but isn't as mature. See Wenderlich's [How to parse HTML in iOS](https://www.raywenderlich.com/14172/how-to-parse-html-on-ios). It's in Objective-C, but the concepts outlined are perfectly applicable to Swift. – Rob May 06 '17 at 00:48

1 Answers1

3

First of all, you need to learn the basic syntax of the pattern of NSRegularExpression:

  • pattern does not contain delimiters
  • pattern does not contain modifiers, you need to pass such info as options
  • When you want to use meta-character \, you need to escape it as \\ in Swift String.

So, the line creating an instance of NSRegularExpression should be something like this:

let regex = try NSRegularExpression(pattern: "<([a-z]*)\\b[^>]*>(.*?)</\\1>", options: .caseInsensitive)

But, as you may already know, your pattern does not contain any code to match href or capture its value.

Something like this would work with your example html:

let pattern = "<a\\b[^>]*\\bhref\\s*=\\s*(\"[^\"]*\"|'[^']*')[^>]*>((?:(?!</a).)*)</a\\s*>"
let regex = try! NSRegularExpression(pattern: pattern, options: .caseInsensitive)
let html = "<p>This is a simple text with some embedded <a\n" +
    "href=\"http://example.com/link/to/some/page?param1=77&param2=22\">links</a>.\n" +
    "This is a <a href=\"https://exmp.le/sample-page/?uu=1\">different link</a>."
let matches = regex.matches(in: html, options: [], range: NSRange(0..<html.utf16.count))
var resultDict: [String: String] = [:]
for match in matches {
    let hrefRange = NSRange(location: match.rangeAt(1).location+1, length: match.rangeAt(1).length-2)
    let innerTextRange = match.rangeAt(2)
    let href = (html as NSString).substring(with: hrefRange)
    let innerText = (html as NSString).substring(with: innerTextRange)
    resultDict[innerText] = href
}
print(resultDict)
//->["different link": "https://exmp.le/sample-page/?uu=1", "links": "http://example.com/link/to/some/page?param1=77&param2=22"]

Remember, my pattern above may mistakenly detect ill-formed a-tags or miss some nested structure, also it lacks feature to work with HTML character entities...

If you want to make your code more robust and generic, you'd better consider adopting HTML parsers as suggested by ColGraff and Rob.

OOPer
  • 47,149
  • 6
  • 107
  • 142