1

I'm using Python's Scrapy to do some web scraping, and I'm trying to get the text in the last td of my last tr in the html below.

<table class="infobox" style="float: right; width: 225px; text-align: left; -moz-border-radius:10px; font-size: 85%" cellpadding="2">
    <tr style="vertical-align: top;">
        <td> <b>Name</b> </td>
        <td> Abraham Lincoln
        </td>
    </tr>
    <tr style="vertical-align: top;">
        <td> <b>Sex</b> </td>
        <td> Male
        </td>
    </tr>
    <tr style="vertical-align: top;">
        <td> <b>Occupation </b>
        </td>
        <td> Former King of <a href="/wiki/Mars" title="Mars">Mars</a>,
            <br />Former President of the United States
        </td>
    </tr>
</table>

Currently, I have this written inside my scrapy's parse function.

def parse(self, response):
    sel = Selector(response)
    data = sel.xpath("//table[@class='infobox']")
    occupation = data.xpath("tr[td/b[contains(.,'Occupation')]]/td[position()>1]/text()").extract()
    print occupation

The printed result is:

[u' Former King of ', u',', u'Former President of the United States\n']

What I'd like to actually get is.. something along the lines of (the most important change would be Mars being added to Former King of):

[u'Former King of Mars', u'Former President of the United States']

I'm aware of the | union in xpath, and I could have written something more in occupation to capture the "Mars" text in the a tag, however, I want to be able to join the a tag text with the td text to output "Former King of Mars" as one of the elements of the printed list. I think with a union, Mars would appear as it's own element inside the list, which is not quite what I need. Anyway, I was hoping there would be some way in xpath I could join the children text of the parent td so that I could get "Former King of Mars" as an element of the outputted list. Also, there could potentially be multiple a tags within a td like for example.. "King" could be inside an a tag as well. Another requirement would be to keep "Former President of the United States" a separate element (somehow recognize the br tag?). I'm not sure what's the best way to go about handling these cases, but I think if there's a way to do it in xpath, it'll be better than working with a list in python because xpath still has reference to the dom tree. What do you guys think? Thanks!

pyramidface
  • 1,207
  • 2
  • 17
  • 39
  • Please Refer this link. http://stackoverflow.com/questions/19309960/scrapy-parsing-list-items-onto-separate-lines – Anandhakumar R Dec 01 '14 at 08:47
  • "Mars" and "King" is not children and parent, they are siblings, also, with "Former" and `
    `. You can get them as a list `tr[td/b[contains(.,'Occupation')]]/td[position()>1]//text()`, but you have to split them yourself.
    – Binux Dec 01 '14 at 08:55
  • please try `td[position()>1]/descendant::text()` – Joel M. Lamsen Dec 01 '14 at 08:59
  • @AvinashRaj I haven't tried that yet, but I've already written a bunch of scrapy, so I was hoping I could stick with this. If xpath doesn't give me what I want, i'll have to give bs4 a shot. – pyramidface Dec 01 '14 at 09:05
  • @JoelM.Lamsen That returns `[u' Former King of ', u'Mars', u',', u'Former President of the United States\n']` but now that I have this list, it's going to be tough to write the logic to recognize that Mars should be connected with Former King of (and not something else) – pyramidface Dec 01 '14 at 09:08
  • then you can use `br` as your reference. Filter the left and right parts using following and preceding axes then concat them. See my answer. – Joel M. Lamsen Dec 01 '14 at 10:19

3 Answers3

0

Through BeautifulSoup, i would do like below.

>>> import re
>>> from bs4 import BeautifulSoup
>>> s = """<table class="infobox" style="float: right; width: 225px; text-align: left; -moz-border-radius:10px; font-size: 85%" cellpadding="2">
    <tr style="vertical-align: top;">
        <td> <b>Name</b> </td>
        <td> Abraham Lincoln
        </td>
    </tr>
    <tr style="vertical-align: top;">
        <td> <b>Sex</b> </td>
        <td> Male
        </td>
    </tr>
    <tr style="vertical-align: top;">
        <td> <b>Occupation </b>
        </td>
        <td> Former King of <a href="/wiki/Mars" title="Mars">Mars</a>,
            <br />Former President of the United States
        </td>
    </tr>
</table>"""
>>> soup = BeautifulSoup(s)
>>> tr = soup.find_all('tr')[-1]
>>> td = tr.find_all('td')[-1]
>>> x = re.split(r',?\n\s*', td.text)
>>> [i for i in x if i]
[' Former King of Mars', 'Former President of the United States']
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
0

Try this:

def parse(self, response):
    sel = Selector(response)
    data = sel.xpath("//table[@class='infobox']")
    occupation = data.xpath("normalize-space(tr[td/b[contains(.,'Occupation')]]/td[position()>1])").extract()
    print occupation

This will return string value of td element with line breaks stripped.

According to spec:

The string-value of an element node is the concatenation of the string-values of all text node descendants of the element node in document order.

Rudolf Yurgenson
  • 603
  • 6
  • 12
0

You can try this xpath:

concat(//tr[td/b[contains(.,'Occupation')]]/td[position() &gt; 1]/descendant::text()[following::br], //tr[td/b[contains(.,'Occupation')]]/td[position() &gt; 1]/descendant::text()[preceding::br])
Joel M. Lamsen
  • 7,143
  • 1
  • 12
  • 14