2

Consider the following html

<div id="relevantID">

<div class="column left">
     <h1> Section-Header-1 </h1>
     <ul>
         <li>item1a</li>
         <li>item1b</li>
         <li>item1c</li>
         <li>item1d</li>
     </ul>
</div>

<div class="column">
     <ul> <!-- Pay attention here -->
         <li>item1e</li>
         <li>item1f</li>
     </ul>
     <h1> Section-Header-2 </h1>
     <ul>
         <li>item2a</li>
         <li>item2b</li>
         <li>item2c</li>
         <li>item2d</li>
     </ul>
</div>

<div class="column right">
     <h1> Section-Header-3 </h1>
     <ul>
         <li>item3a</li>
         <li>item3b</li>
         <li>item3c</li>
         <li>item3d</li>
     </ul>
</div>

</div>

My objective is to extract the items for each Section headers. However, inconveniently the designer of the webpage decided to break up the data into three columns, adding an additional div (with classes column right etc).

My current method of extraction was using the xpath

for section headers, I use the xpath (get all h1 elements withing a div with given id)

//div[@id="relevantID"]//h1 

above returns a list of h1 elements, looping over each element I apply the additional selector, for each matched h1 element, look up the next ul node and retreive all its li nodes.

following-sibling::ul//li

But thanks to the designer's aesthetics, I am failing in the one particular case I've marked in the HTML file. Where the items are split across two different column divs.

I can probably bypass this problem by stripping out the column divs entirely, but I don't think modifying the html to make a selector match is considered good (I haven't seen it needed anywhere in the examples I've browsed so far).

What would be a good way to extract data that has been formatted like this? Full solutions are not neccessary, hints/tips will do. Thanks!

pad
  • 2,296
  • 2
  • 16
  • 23

2 Answers2

1

The columns do frustrate use of following-sibling:: and preceding-sibling::, but you could instead use the following:: and preceding:: axis if the columns at least keep the list items in proper document order. (That is indeed the case in your example.)

The following XPath will select all li items, regardless of column, occurring after the "Section-Header-1" h1 and before the "Section-Header-2" h1 header in document order:

//div[@id='relevantID']//li[normalize-space(preceding::h1) = 'Section-Header-1'
                            and normalize-space(following::h1) = 'Section-Header-2']

Specifically, it selects the following items from your example HTML:

<li>item1a</li>
<li>item1b</li>
<li>item1c</li>
<li>item1d</li>
<li>item1e</li>
<li>item1f</li>
kjhughes
  • 106,133
  • 27
  • 181
  • 240
1

You can combine following-sibling and preceding-sibling to get possible li elements in a div before the h2 and use the union operator |. As example for the second h2:

((//div[@id="relevantID"]//h1)[2]/preceding-sibling::ul//li) | 
((//div[@id="relevantID"]//h1)[2]/following-sibling::ul//li)

Result:

<li>item1e</li>
<li>item1f</li>
<li>item2a</li>
<li>item2b</li>
<li>item2c</li>
<li>item2d</li>

As you're already selecting all h1 using //div[@id="relevantID"]//h1 and retrieving all li items for each h1 using as a second step following-sibling::ul//li, you could combine this to following-sibling::ul//li | preceding-sibling::ul//li.

matthias_h
  • 11,356
  • 9
  • 22
  • 40