How to set up XPath query for HTML parsing?

Question

Here is some HTML code from http://chem.sis.nlm.nih.gov/chemidplus/rn/75-07-0 in Google Chrome that I want to parse the website for some project.

<div id="names">
<h2>Names and Synonyms</h2>
<div class="ds"><button class="toggle1Col"title="Toggle display between 1 column of wider results and multiple columns.">&#8596;</button>
    <h3 id="yui_3_18_1_3_1434394159641_407">Name of Substance</h3>
    <ul>
        <li id="ds2">
        ``  <div>Acetaldehyde</div>
        </li>
    </ul>
</div>

I wrote a python script to help me do such a thing by grabbing the name under one of the sections, but it just isn't returning the name. I think it's my xpath query, suggestions?

from lxml import html
import requests  
import csv 

names1 = []

page = requests.get('http://chem.sis.nlm.nih.gov/chemidplus/rn/75-07-0') 
tree = html.fromstring(page.text)

//This will grab the name data 

names = tree.xpath('//*[@id="yui_3_18_1_3_1434380225687_700"]')

//Print the name data 
print 'Names: ', names 

//Convert the data into a string  
names1.append(names)

//Print the bit length 

print len(names1)

//Write it to csv 

b = open('testchem.csv', 'wb')  
a = csv.writer(b)  
a.writerows(names1)
b.close()
print "The end"

I know nothing about pyton, but may be you need add `/text() ` - //*[@id="yui_3_18_1_3_1434394159641_407"]/text() — splash58, Jun 15 '15 at 19:06

unutbu · Accepted Answer · 2015-06-15T19:32:39.910

It is important to inspect the string returned by page.text and not just rely on the page source as returned by your Chrome browser. Web sites can return different content depending on the User-Agent, and moreover, GUI browsers such as your Chrome browser may change the content by executing JavaScript while in contrast, requests.get does not.

If you write the contents to a file

import requests
page = requests.get('http://chem.sis.nlm.nih.gov/chemidplus/rn/75-07-0') 
with open('/tmp/test', 'wb') as f:
     f.write(page.text)

and use a text editor to search for "yui_3_18_1_3_1434380225687_700" you'll find that there is no tag with that attribute value.

If instead you search for Name of Substance you'll find

<div><br>Search for this InChIKey on the <a href="http://www.google.com/search?q=%22IKHGUXGNUITLKF-UHFFFAOYSA-N%22" target="new" rel="nofollow">Web</a></div></div><div id="names"><h2>Names and Synonyms</h2><div class="ds"><button class="toggle1Col" title="Toggle display between 1 column of wider results and multiple columns.">&#8596;</button><h3>Name of Substance</h3><ul>
<li id="ds2"><div>Acetaldehyde</div></li>

Therefore, instead you could use:

In [219]: tree.xpath('//*[text()="Name of Substance"]/..//div')[0].text_content()
Out[219]: 'Acetaldehyde'

How this XPath was found:

Starting from the <h3> tag:

In [215]: tree.xpath('//*[text()="Name of Substance"]')
Out[215]: [<Element h3 at 0x7f5a290e85d0>]

The <div> tag that we want is not a child but rather it is a subchild of the parent of <h3>. Therefore, go up to the parent:

In [216]: tree.xpath('//*[text()="Name of Substance"]/..')
Out[216]: [<Element div at 0x7f5a290f02b8>]

and then use //div to search for all <div>s inside the parent:

In [217]: tree.xpath('//*[text()="Name of Substance"]/..//div')
Out[217]: 
[<Element div at 0x7f5a290e88e8>,
 <Element div at 0x7f5a290e8940>,
 ...]

The first div is the one that we want:

In [218]: tree.xpath('//*[text()="Name of Substance"]/..//div')[0]
Out[218]: <Element div at 0x7f5a290e88e8>

and we can extract the text using the text_content method:

In [219]: tree.xpath('//*[text()="Name of Substance"]/..//div')[0].text_content()
Out[219]: 'Acetaldehyde'

Where are you getting the page library? or is it the python GUI? — TimTom, Jun 15 '15 at 19:30
When I try to run it, I get an error about the with open loop IOError: [Errno 2] No such file or directory: '/tmp/test' shouldn't it create the folder? — TimTom, Jun 15 '15 at 19:35
Did you open the file in write mode: i.e., `open('/tmp/test', 'wb')` or just `open('/tmp/test')`? — unutbu, Jun 15 '15 at 19:38
This is what I have `page = requests.get('http://chem.sis.nlm.nih.gov/chemidplus/rn/75-07-0')` `with open('/tmp/test', 'wb') as f:` `f.write(page.text)` `tree = html.fromstring(page.text)` — TimTom, Jun 15 '15 at 19:40
Save that code to a file (such as `script.py`) and run `python script.py`. It should work, assuming your indentation is correct. — unutbu, Jun 15 '15 at 19:45
I literally did so but I still get the error. I'm using Windows Powershell, does that affect it? — TimTom, Jun 15 '15 at 19:48
Change `/tmp/test` to some path were you like to write temporary files. (`/tmp` is the usual place for unix, not Windows). — unutbu, Jun 15 '15 at 19:50

How to set up XPath query for HTML parsing?

1 Answers1

Linked