1

Just as the title says, I've been working on crawling the article, all that's left is the author.

Below is my code, using pyquery to compile the paragraphs and author, with only the author returning blank

site of target: http://business.transworld.net/153984/news/surfrider-foundation-names-chad-nelsen-new-ceo/

def extract_text_pyquery(html):
    p = pq(html)
    article_whole = p.find(".entry")
    p_tag = article_whole('p')
    print len(p_tag)
    print p_tag
    for i in range (0, len(p_tag)):
        text = p_tag.eq(i).text()
        print text
    entire = p.find("#main")
    author = entire.find('a').filter('.author')
    print 'By:', author
fsbinesh
  • 21
  • 3

1 Answers1

0

the class isn't author, the rel is; period selects a class. You should instead filter for '[rel="author"]', brackets let you filed onter bas non standard tags.

ragingSloth
  • 1,094
  • 8
  • 22
  • Thank you! Almost had it, I guess I should've been more specific in that I want to obtain the name without the tags/functions attached. Currently, it shows the line copied from the page source, then the name alone. I've entered it as you suggested, then added the "for i in range" and that was the result. – fsbinesh Oct 01 '14 at 06:01
  • that's going to be specific to pyquery, but there should be a way to access an individual tags value – ragingSloth Oct 01 '14 at 15:10