What is the best practice to parse html in swift?

Question

I'm a Swift newbie. I need for something like Python's BeautifulSoup in Swift iOS project. Precisely, I need to get all href of <a> that ends with ".txt". What are the steps that I should take?

score 113 · Accepted Answer · edited Dec 24 '22 at 03:05

There are several nice libraries of HTML Parsing using Swift and Objective-C like the followings:

Take a look in the following examples in the four libraries posted above, mainly parsed using XPath 2.0:

hpple:

let data = NSData(contentsOfFile: path)
let doc = TFHpple(htmlData: data)

if let elements = doc.searchWithXPathQuery("//a/@href[ends-with(.,'.txt')]") as? [TFHppleElement] {
   for element in elements {
       println(element.content)
   }
}

NDHpple:

let data = NSData(contentsOfFile: path)!
let html = NSString(data: data, encoding: NSUTF8StringEncoding)!
let doc = NDHpple(HTMLData: html)
if let elements = doc.searchWithXPathQuery("//a/@href[ends-with(.,'.txt')]") {
   for element in elements {
     println(element.children?.first?.content)
   }
}

Kanna (Xpath and CSS Selectors):

let html = "<html><head></head><body><ul><li><input type='image' name='input1' value='string1value' class='abc' /></li><li><input type='image' name='input2' value='string2value' class='def' /></li></ul><span class='spantext'><b>Hello World 1</b></span><span class='spantext'><b>Hello World 2</b></span><a href='example.com'>example(English)</a><a href='example.co.jp'>example(JP)</a></body>"

if let doc = Kanna.HTML(html: html, encoding: NSUTF8StringEncoding) {
   var bodyNode   = doc.body

   if let inputNodes = bodyNode?.xpath("//a/@href[ends-with(.,'.txt')]") {
      for node in inputNodes {
         println(node.contents)
      }
   }
}

Fuzi (Xpath and CSS Selectors):

let html = "<html><head></head><body><ul><li><input type='image' name='input1' value='string1value' class='abc' /></li><li><input type='image' name='input2' value='string2value' class='def' /></li></ul><span class='spantext'><b>Hello World 1</b></span><span class='spantext'><b>Hello World 2</b></span><a href='example.com'>example(English)</a><a href='example.co.jp'>example(JP)</a></body>"

do {
  // if encoding is omitted, it defaults to NSUTF8StringEncoding
  let doc = try HTMLDocument(string: html, encoding: NSUTF8StringEncoding)

  // XPath queries
  for anchor in doc.xpath("//a/@href[ends-with(.,'.txt')]") {
    print(anchor.stringValue)
  }

} catch let error {
    print(error)
}

The ends-with function is part of Xpath 2.0.

SwiftSoup (CSS Selectors):

do{
    let doc: Document = try SwiftSoup.parse("...")
    let links: Elements = try doc.select("a[href]") // a with href
    let pngs: Elements = try doc.select("img[src$=.png]")
   
    // img with src ending .png
    let masthead: Element? = try doc.select("div.masthead").first()

    // div with class=masthead
    let resultLinks: Elements? = try doc.select("h3.r > a") // direct a after h3
} catch Exception.Error(let type, let message){
    print(message)
} catch {
   print("error")
}

Ji (XPath):

let jiDoc = Ji(htmlURL: URL(string: "http://www.apple.com/support")!)
let titleNode = jiDoc?.xPath("//head/title")?.first
print("title: \(titleNode?.content)") // title: Optional("Official Apple Support")

I get `ambiguous use of init(HTMLData:)` all the time. Tried messing around with `as!` and `:` and everything but I can't get it working. Any ideas? I hate swift — user2161301, Aug 15 '17 at 10:04
Ok now this took me 2 hours, it's htmlData and not HTMLData. Thanks for your answer tho, I will edit your reply to save others from that — user2161301, Aug 15 '17 at 10:26
Might also be the case with the other libraries but cbf to check — user2161301, Aug 15 '17 at 10:29
I notice that libxml2 does not appear in the list. It is not even mentioned. Does that mean that is is totally old fashioned. I myself came to this post while searching on the html parsing subject; I indeed have a problem (https://stackoverflow.com/questions/52695180/htmldocptr-gethtml-issue) that I need to solve. — Michel, Oct 08 '18 at 06:22
@Michel You can still use `libxml2` in your project directly and create a wrapper to the library, but for example, the latest added (**Ji**) is exactly that. — Victor Sigler, Oct 09 '18 at 15:28
Thanks. In the meanwhile I started to look at SwiftSoup, it seems to be OK for the time being. At least I can do what I need. Do you have any opinion on it? — Michel, Oct 10 '18 at 14:02
NDHpple doesn't compile and is fairly outdated. I couldn't get it to work in Swift 5. — JCutting8, Apr 19 '22 at 21:27

score 9 · Answer 2 · edited Nov 20 '16 at 22:50

9

Try SwiftSoup, a port of jsoup to Swift.

let html: String = "<a id=1 href='?foo=bar&mid&lt=true'>One</a> <a id=2 href='?foo=bar&lt;qux&lg=1'>Two</a>";
    let els: Elements = try SwiftSoup.parse(html).select("a");
    for element: Element in els.array(){
        print(try element.attr("href"))
    }

edited Nov 20 '16 at 22:50

Ed Rands

911
12
19

answered Nov 20 '16 at 21:17

Scinfu

1,081
13
18

Good work , Documentation might t be a bit clear . Current information is not just enough to get start.How to perform Actions via document object and using forms will be a great start – Muhammad Adnan Nov 29 '16 at 09:58
@m Other documentation is on Wiki section but i'm writing it . – Scinfu Nov 29 '16 at 11:22
@Scinfu: Does it support Swift 2? – user484691 Apr 20 '17 at 05:43
It's written in swift 3 – Scinfu Apr 20 '17 at 07:55

Kio Coan · Answer 3 · 2015-06-26T19:54:36.050

You could try this swift-html-parser:

https://github.com/tid-kijyun/Swift-HTML-Parser

It helps a lot.

And for getting your html from a txt you can:

let file = "file.txt"

if let dirs : [String] = NSSearchPathForDirectoriesInDomains(NSSearchPathDirectory.DocumentDirectory, NSSearchPathDomainMask.AllDomainsMask, true) as? [String] {
    let dir = dirs[0] //documents directory
    let path = dir.stringByAppendingPathComponent(file);
    let html = String(contentsOfFile: path, encoding: NSUTF8StringEncoding, error: nil)

Edit:

To get what you need you could use as the exemple:

import Foundation

let html = "theHtmlYouWannaParse"

var err : NSError?
var parser     = HTMLParser(html: html, error: &err)
if err != nil {
    println(err)
    exit(1)
}

var bodyNode   = parser.body

if let inputNodes = bodyNode?.findChildTags("b") {
    for node in inputNodes {
        println(node.contents)
    }
}

if let inputNodes = bodyNode?.findChildTags("a") {
    for node in inputNodes {
        println(node.getAttributeNamed("href")) //<- Here you would get your files link
    }
}

Thank you. I don't need to extract html from txt. I need extract .txt hrefs from html via your parser:`Da TXT --> http://foo.com/bar.txt` — amazingbasil, Jun 26 '15 at 19:44

What is the best practice to parse html in swift?

3 Answers3

Linked