0

Suppose I have a html link like this:

<a href = "https://mitsui-shopping-park.com/lalaport/koshien/" target="_blank"> https://mitsui-shopping-park.com/lalaport / koshien / </a>

I want to extract:

<a href = "THIS LINK" target="_blank"> NOT THIS LINK </a> 

I tried: someString.replacingOccurrences(of: "<[^>]+>", with: "", options: .regularExpression, range: nil) but that gives me:

<a href = "NOT THIS LINK" target="_blank"> BUT THIS LINK </a>

Please help.

Leo Dabus
  • 229,809
  • 59
  • 489
  • 571
Arafin Russell
  • 1,487
  • 1
  • 18
  • 37
  • There is some helpful info about parsing HTML here: https://stackoverflow.com/questions/31080818/what-is-the-best-practice-to-parse-html-in-swift – user212514 May 09 '19 at 03:36
  • The currently accepted answer gives `"href = "https://mitsui-shopping-park.com/lalaport/koshien/"` instead of the actual link, is that the desired output? – ielyamani May 09 '19 at 06:45
  • In my case yes this is the desired output. – Arafin Russell May 09 '19 at 06:48

3 Answers3

4

Here's one possible solution to grab the value between the href=" and the closing ". This only works with one href in the string.

let html = "<a href = \"https://mitsui-shopping-park.com/lalaport/koshien/\" target=\"_blank\"> https://mitsui-shopping-park.com/lalaport / koshien / </a>"

if let hrefRange = html.range(of: "(?:href\\s*=\\s*\")[^\"]*(?:\")", options: .regularExpression) {
    let href = html[hrefRange]
    print(href)
} else {
    print("There is no href")
}

Let's break down that regular expression:

First, let's remove the extra \ needed in the RE to make it a value Swift string. This leaves us with:

(?:href\s*=\s*")[^"]*(?:")

This has three main parts:

(?:href\s*=\s*") - the href, optional space, =, optional space, and opening quote
[^"]* - the actual URL - everything that isn't a quote
(?:") - the close quote

The (?: ) syntax means that the stuff inside won't be part of the returned string.

rmaddy
  • 314,917
  • 42
  • 532
  • 579
4

No need for a regular expression, you could use the link property of an attributed string.

First, let's use this extension:

extension String{
    func convert2Html() -> NSAttributedString {

        guard let data = data(using: .utf8) else { return NSAttributedString() }

        do {
            let htmlAttrib = NSAttributedString.DocumentType.html
            return try NSAttributedString(data: data,
                                          options: [.documentType : htmlAttrib],
                                          documentAttributes: nil)
        } catch {
            return NSAttributedString()
        }
    }
}

to convert this String:

let html = "<a href = \"https://mitsui-shopping-park.com/lalaport/koshien/\" target=\"_blank\"> https://mitsui-shopping-park.com/lalaport / koshien / </a>"

to an NSAttributedString:

let attrib = html.convert2Html()

And then extract the link this way :

let link = attrib.attribute(.link, at: 0, effectiveRange: nil)

if let url = link as? NSURL, let href = url.absoluteString {
    print(href)  //https://mitsui-shopping-park.com/lalaport/koshien/
}
ielyamani
  • 17,807
  • 10
  • 55
  • 90
  • using NSAttributedString.DocumentType.html is quite risky, as it can lead to crashes, when outside of main thread, or the document is too large – Peter Lapisu Mar 01 '23 at 12:11
0

Use NSRegularExpression.matches for the capture group feature of Regular Expression. I always use this handy extension method:

extension String {
    func capturedGroups(withRegex pattern: String) -> [String?] {
        var results = [String?]()

        var regex: NSRegularExpression
        do {
            regex = try NSRegularExpression(pattern: pattern, options: [])
        } catch {
            return results
        }

        let matches = regex.matches(in: self, options: [], range: NSRange(location:0, length: self.count))

        guard let match = matches.first else { return results }
        let lastRangeIndex = match.numberOfRanges - 1
        guard lastRangeIndex >= 1 else { return results }

        for i in 0...lastRangeIndex {
            let capturedGroupIndex = match.range(at: i)
            if(capturedGroupIndex.length>0)
            {
                let matchedString = (self as NSString).substring(with: capturedGroupIndex)
                results.append(matchedString)
            }
            else
            {
                results.append(nil)
            }
        }

        return results
    }
}

var html = """
<a href = "https://mitsui-shopping-park.com/lalaport/koshien/" target="_blank"> https://mitsui-shopping-park.com/lalaport / koshien / </a>
"""
print(html.capturedGroups(withRegex: "href\\s*=\\s*\"([^\"]+)\"")[1])
Ricky Mo
  • 6,285
  • 1
  • 14
  • 30