Scraperwiki scrape query: using lxml to extract links

Question

I suspect this is a trivial query but hope someone can help me with a query I've got using lxml in a scraper I'm trying to build.

https://scraperwiki.com/scrapers/thisisscraper/

I'm working line-by-line through the tutorial 3 and have got so far with trying to extract the next page link. I can use cssselect to identify the link, but I can't work out how to isolate just the href attribute rather than the whole anchor tag.

Can anyone help?

def scrape_and_look_for_next_link(url):
    html = scraperwiki.scrape(url)
    print html
    root = lxml.html.fromstring(html) #turn the HTML into lxml object
    scrape_page(root)
    next_link = root.cssselect('ol.pagination li a')[-1]

    attribute = lxml.html.tostring(next_link)
    attribute = lxml.html.fromstring(attribute)

    #works up until this point
    attribute = attribute.xpath('/@href')
    attribute = lxml.etree.tostring(attribute)
    print attribute

score 1 · Accepted Answer · answered Jul 28 '12 at 08:44

1

CSS selectors can select elements that have an href attribute with eg. a[href] but they can not extract the attribute value by themselves.

Once you have the element from cssselect, you can use next_link.get('href') to get the value of the attribute.

answered Jul 28 '12 at 08:44

Simon Sapin

9,790
3
35
44

score 1 · Answer 2 · answered Aug 22 '12 at 17:30

1

link = link.attrib['href']

should work

answered Aug 22 '12 at 17:30

Mouseroot

1,034
2
9
13

Scraperwiki scrape query: using lxml to extract links

2 Answers2