Can't get subscript text, from parsing html

Question

I am parsing a website, for an inorganic compound, and need to get it's chemical formula.

let data = NSData(contentsOf: URL(string: "https://en.wikipedia.org/wiki/Gold(III)_bromide")!)
let doc = TFHpple(htmlData: data as! Data)

if let elements = doc?.search(withXPathQuery: "//*[@class='selflink']/text()") as? [TFHppleElement] {
    for element in elements {
        print("------")
        print(element.content)
    }
}

It prints out "AuBr" But I need it to print the whole formula out which is "AuBr₃"

This is the html code I'm getting the formula from:

How can I make it print out the whole formula with the 3 at the end?

score 1 · Accepted Answer · answered Nov 24 '16 at 15:13

1

Given the following HTML from the Wiki page:

<tr>
  <td>
    <div style="padding:0.1em 0;line-height:1.2em;"><a href="/wiki/Chemical_formula" title="Chemical formula">Chemical formula</a></div>
  </td>
  <td>AuBr<sub>3</sub></td>
</tr>

the following XPath expression

string(//tr[td[1]/div/a = "Chemical formula"]/td[2])

will return:

> xmllint --xpath 'string(//tr[td[1]/div/a = "Chemical formula"]/td[2])' ~/test.html
AuBr3

answered Nov 24 '16 at 15:13

Markus

3,155
2
23
33

I have a question more, if the html was like this `
Barium chloride – BaCl₂

Benja0906

Nov 25 '16 at 22:25

@Benja0906 You can use `concat(//li/text()[2],//li/sub)` to get ` – BaCl2`. I'll assume you can figure out how to strip the prefix you don't want. But this is so dependent on the exact structure of the HTML, I wouldn't recommend using it. – Markus Nov 26 '16 at 21:53

I cant seem to get it to work, that html is from the html source code of this site: https://en.wikipedia.org/wiki/List_of_inorganic_compounds, it's from line 83 and down. I need to get the formula for every compund which is different for every line? – Benja0906 Nov 26 '16 at 22:55

@Benja0906 Well, the answer you received on your separate question should work. – Markus Nov 26 '16 at 23:14

score 0 · Answer 2 · answered Nov 29 '16 at 22:15

Try SwiftSoup

Parse your html:

let document = try SwiftSoup.parse("<li><strong class='selflink'>AuBr<sub>3</sub></strong></li>")

let selflinkElements = try document.getElementsByClass("selflink")

print(selflinkElements.get(0).tagName())//print "strong"

print(selflinkElements.get(0).text())//print "AuBr3"

print(selflinkElements.get(0).html())//print "AuBr<sub>3</sub>"

Can't get subscript text, from parsing html

2 Answers2