Python: lxml xpath to extract content

Question

Below code able to extract PE from the reuters link below. However, my method is not robust as the webpage for another stock has two lines lesser and result a shift of data. How can I encounter this problem. I would like to point straight to the part of PE to extract the data but do not know how to do it. link 1: http://www.reuters.com/finance/stocks/financialHighlights?symbol=MYEG.KL link 2: http://www.reuters.com/finance/stocks/financialHighlights?symbol=ANNJ.KL

from lxml import html
import lxml

page2 = requests.get('http://www.reuters.com/finance/stocks/financialHighlights?symbol=MYEG.KL')
treea = html.fromstring(page2.content)
tree4 = treea.xpath('//td[@class]/text()')
PE= tree4[37]

This is the part I wish that the code can extract only this part so that any changes of the webpage will not affected.

 <tr class="stripe">
                <td>P/E Ratio (TTM)</td>
                <td class="data">36.79</td>
                <td class="data">25.99</td>
                <td class="data">21.70</td>
            </tr>

score 1 · Accepted Answer · answered Sep 07 '16 at 14:52

Use the text to find the first td then extract the sibling td's:

 treea.xpath('//td[contains(.,"P/E Ratio")]/following-sibling::td/text()')

That will work regardless:

In [8]: page2 = requests.get('http://www.reuters.com/finance/stocks/financialHighlights?symbol=MYEG.KL')

In [9]: treea = html.fromstring(page2.content)    
In [10]: tree4 = treea.xpath('//td[contains(.,"P/E Ratio")]/following-sibling::td/text()')

In [11]: print(tree4)
['36.79', '25.99', '21.41']

In [12]: page2 = requests.get('http://www.reuters.com/finance/stocks/financialHighlights?symbol=ANNJ.KL')
In [13]: treea = html.fromstring(page2.content)

In [14]: tree4 = treea.xpath('//td[contains(.,"P/E Ratio")]/following-sibling::td/text()')

In [15]: print(tree4)
['--', '25.49', '17.30']

Python: lxml xpath to extract content

1 Answers1