I'm using Python's Scrapy to do some web scraping, and I'm trying to get the text in the last td of my last tr in the html below.
<table class="infobox" style="float: right; width: 225px; text-align: left; -moz-border-radius:10px; font-size: 85%" cellpadding="2">
<tr style="vertical-align: top;">
<td> <b>Name</b> </td>
<td> Abraham Lincoln
</td>
</tr>
<tr style="vertical-align: top;">
<td> <b>Sex</b> </td>
<td> Male
</td>
</tr>
<tr style="vertical-align: top;">
<td> <b>Occupation </b>
</td>
<td> Former King of <a href="/wiki/Mars" title="Mars">Mars</a>,
<br />Former President of the United States
</td>
</tr>
</table>
Currently, I have this written inside my scrapy's parse function.
def parse(self, response):
sel = Selector(response)
data = sel.xpath("//table[@class='infobox']")
occupation = data.xpath("tr[td/b[contains(.,'Occupation')]]/td[position()>1]/text()").extract()
print occupation
The printed result is:
[u' Former King of ', u',', u'Former President of the United States\n']
What I'd like to actually get is.. something along the lines of (the most important change would be Mars being added to Former King of):
[u'Former King of Mars', u'Former President of the United States']
I'm aware of the | union in xpath, and I could have written something more in occupation to capture the "Mars" text in the a tag, however, I want to be able to join the a tag text with the td text to output "Former King of Mars" as one of the elements of the printed list. I think with a union, Mars would appear as it's own element inside the list, which is not quite what I need. Anyway, I was hoping there would be some way in xpath I could join the children text of the parent td so that I could get "Former King of Mars" as an element of the outputted list. Also, there could potentially be multiple a tags within a td like for example.. "King" could be inside an a tag as well. Another requirement would be to keep "Former President of the United States" a separate element (somehow recognize the br tag?). I'm not sure what's the best way to go about handling these cases, but I think if there's a way to do it in xpath, it'll be better than working with a list in python because xpath still has reference to the dom tree. What do you guys think? Thanks!
`. You can get them as a list `tr[td/b[contains(.,'Occupation')]]/td[position()>1]//text()`, but you have to split them yourself. – Binux Dec 01 '14 at 08:55