XPATH in Scrapy Is there a way to combine text from the children of the same parent into a single element before it is returned?

Question

I'm using Python's Scrapy to do some web scraping, and I'm trying to get the text in the last td of my last tr in the html below.

<table class="infobox" style="float: right; width: 225px; text-align: left; -moz-border-radius:10px; font-size: 85%" cellpadding="2">
    <tr style="vertical-align: top;">
        <td> <b>Name</b> </td>
        <td> Abraham Lincoln
        </td>
    </tr>
    <tr style="vertical-align: top;">
        <td> <b>Sex</b> </td>
        <td> Male
        </td>
    </tr>
    <tr style="vertical-align: top;">
        <td> <b>Occupation </b>
        </td>
        <td> Former King of <a href="/wiki/Mars" title="Mars">Mars</a>,
            <br />Former President of the United States
        </td>
    </tr>
</table>

Currently, I have this written inside my scrapy's parse function.

def parse(self, response):
    sel = Selector(response)
    data = sel.xpath("//table[@class='infobox']")
    occupation = data.xpath("tr[td/b[contains(.,'Occupation')]]/td[position()>1]/text()").extract()
    print occupation

The printed result is:

[u' Former King of ', u',', u'Former President of the United States\n']

What I'd like to actually get is.. something along the lines of (the most important change would be Mars being added to Former King of):

[u'Former King of Mars', u'Former President of the United States']

I'm aware of the | union in xpath, and I could have written something more in occupation to capture the "Mars" text in the a tag, however, I want to be able to join the a tag text with the td text to output "Former King of Mars" as one of the elements of the printed list. I think with a union, Mars would appear as it's own element inside the list, which is not quite what I need. Anyway, I was hoping there would be some way in xpath I could join the children text of the parent td so that I could get "Former King of Mars" as an element of the outputted list. Also, there could potentially be multiple a tags within a td like for example.. "King" could be inside an a tag as well. Another requirement would be to keep "Former President of the United States" a separate element (somehow recognize the br tag?). I'm not sure what's the best way to go about handling these cases, but I think if there's a way to do it in xpath, it'll be better than working with a list in python because xpath still has reference to the dom tree. What do you guys think? Thanks!

Please Refer this link. http://stackoverflow.com/questions/19309960/scrapy-parsing-list-items-onto-separate-lines — Anandhakumar R, Dec 01 '14 at 08:47
"Mars" and "King" is not children and parent, they are siblings, also, with "Former" and `
`. You can get them as a list `tr[td/b[contains(.,'Occupation')]]/td[position()>1]//text()`, but you have to split them yourself. — Binux, Dec 01 '14 at 08:55
@AvinashRaj I haven't tried that yet, but I've already written a bunch of scrapy, so I was hoping I could stick with this. If xpath doesn't give me what I want, i'll have to give bs4 a shot. — pyramidface, Dec 01 '14 at 09:05
@JoelM.Lamsen That returns `[u' Former King of ', u'Mars', u',', u'Former President of the United States\n']` but now that I have this list, it's going to be tough to write the logic to recognize that Mars should be connected with Former King of (and not something else) — pyramidface, Dec 01 '14 at 09:08
then you can use `br` as your reference. Filter the left and right parts using following and preceding axes then concat them. See my answer. — Joel M. Lamsen, Dec 01 '14 at 10:19

score 0 · Answer 1 · answered Dec 01 '14 at 09:01

Through BeautifulSoup, i would do like below.

>>> import re
>>> from bs4 import BeautifulSoup
>>> s = """<table class="infobox" style="float: right; width: 225px; text-align: left; -moz-border-radius:10px; font-size: 85%" cellpadding="2">
    <tr style="vertical-align: top;">
        <td> <b>Name</b> </td>
        <td> Abraham Lincoln
        </td>
    </tr>
    <tr style="vertical-align: top;">
        <td> <b>Sex</b> </td>
        <td> Male
        </td>
    </tr>
    <tr style="vertical-align: top;">
        <td> <b>Occupation </b>
        </td>
        <td> Former King of <a href="/wiki/Mars" title="Mars">Mars</a>,
            <br />Former President of the United States
        </td>
    </tr>
</table>"""
>>> soup = BeautifulSoup(s)
>>> tr = soup.find_all('tr')[-1]
>>> td = tr.find_all('td')[-1]
>>> x = re.split(r',?\n\s*', td.text)
>>> [i for i in x if i]
[' Former King of Mars', 'Former President of the United States']

score 0 · Answer 2 · answered Dec 01 '14 at 09:36

Try this:

def parse(self, response):
    sel = Selector(response)
    data = sel.xpath("//table[@class='infobox']")
    occupation = data.xpath("normalize-space(tr[td/b[contains(.,'Occupation')]]/td[position()>1])").extract()
    print occupation

This will return string value of td element with line breaks stripped.

According to spec:

The string-value of an element node is the concatenation of the string-values of all text node descendants of the element node in document order.

score 0 · Answer 3 · answered Dec 01 '14 at 10:17

0

You can try this xpath:

concat(//tr[td/b[contains(.,'Occupation')]]/td[position() &gt; 1]/descendant::text()[following::br], //tr[td/b[contains(.,'Occupation')]]/td[position() &gt; 1]/descendant::text()[preceding::br])

answered Dec 01 '14 at 10:17

Joel M. Lamsen

7,143
1
12
14

XPATH in Scrapy Is there a way to combine text from the children of the same parent into a single element before it is returned?

3 Answers3