1

So as the title states I have some HTML code from http://chem.sis.nlm.nih.gov/chemidplus/name/acetone that I am parsing and want to extract some data like the Acetone under MeSH Heading from my similar post How to set up XPath query for HTML parsing?

<div id="names">
 <h2>Names and Synonyms</h2>
  <div class="ds">
   <button class="toggle1Col" title="Toggle display between 1 column of wider results and multiple columns.">&#8596;</button>
 <h3>Name of Substance</h3>
 <div class="yui3-g-r">
  <div class="yui3-u-1-4">
   <ul>
    <li id="ds2">
     <div>2-Propanone</div>
    </li>
   </ul>
  </div>
  <div class="yui3-u-1-4">
   <ul>
    <li id="ds3">
     <div>Acetone</div>
    </li>
   </ul>
  </div>
  <div class="yui3-u-1-4">
   <ul>
    <li id="ds4">
     <div>Acetone [NF]</div>
    </li>
   </ul>
  </div>
  <div class="yui3-u-1-4">
   <ul>
    <li id="ds5">
     <div>Dimethyl ketone</div>
    </li>
   </ul>
  </div>
 </div>
 <h3>MeSH Heading</h3>
  <ul>
   <li id="ds6">
    <div>Acetone</div>
   </li>
  </ul>
 </div>
</div>

Previously in other pages I would do mesh_name = tree.xpath('//*[text()="MeSH Heading"]/..//div')[1].text_content() to extract the data because other pages had similar structures, but now I see that is not the case as I didn't account for inconsistency. So, is there a way of after going to the node that I want and then obtaining it's subchild, allowing for consistency across different pages?

Would doing tree.xpath('//*[text()="MeSH Heading"]//preceding-sibling::text()[1]') work?

Community
  • 1
  • 1
TimTom
  • 97
  • 3
  • 12

1 Answers1

1

From what I understand, you need to get the list of items by a heading title.

How about making a reusable function that would work for every heading in the "Names and Synonyms" container:

from lxml.html import parse


tree = parse("http://chem.sis.nlm.nih.gov/chemidplus/name/acetone")

def get_contents_by_title(tree, title):
    return tree.xpath("//h3[. = '%s']/following-sibling::*[1]//div/text()" % title)

print get_contents_by_title(tree, "Name of Substance")
print get_contents_by_title(tree, "MeSH Heading")

Prints:

['2-Propanone', 'Acetone', 'Acetone [NF]', 'Dimethyl ketone']
['Acetone']
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Ah you're right, I forgot about functions. Though could you explain the xpath syntax for the function? – TimTom Jun 30 '15 at 17:02
  • @TimTom sure, here we are locating `h3` by text, get the next following-sibling and extract the text of all div elements anywhere inside this sibling. Hope this makes things a bit clear. – alecxe Jun 30 '15 at 17:05