1
def extract_page_data(html):
tree = lxml.html.fromstring(html)
item_sel = CSSSelector('.my-item')
text_sel = CSSSelector('.my-text-content')
time_sel = CSSSelector('.time')
author_sel = CSSSelector('.author-text')
a_tag = CSSSelector('.a')

    for item in item_sel(tree):
    yield {'href': a_tag(item)[0].text_content(),
           'my pagetext': text_sel(item)[0].text_content(),
           'time': time_sel(item)[0].text_content().strip(),
           'author': author_sel(item)[0].text_content()}

I want to extract href but I am not able to extract it using this code

innicoder
  • 2,612
  • 3
  • 14
  • 29

1 Answers1

6

Try to extract @href as

'href': a_tag(item)[0].attrib['href']

or

'href': a_tag(item)[0].get('href')

As an option you can also use XPath

tree.xpath(".//a/@href")
Andersson
  • 51,635
  • 17
  • 77
  • 129