Getting, visiting and limiting the number of links using Nokogiri and Mechanize?

Question

I am trying to scrape the five latest stories from CNN.com and retrieve their links along with the first paragraph of each story. I have this simple script:

url = "http://edition.cnn.com/?refresh=1"
agent = Mechanize.new
agent.get("http://edition.cnn.com/?refresh=1").search("//div[@id='cnn_maintt2bul']/div/div/ul/li[count(*)=3]/a").each do |headline| 
 article = headline.text
 link = URI.join(url, headline[:href]).to_s
 page = headline.click(link)
 paragraph1 = page.at_css(".adtag15090+ p").text
 puts "#{article}"
 puts "#{link}"
 puts "#{paragraph1}"
 puts "\n"
end

This code won't work because the click method would not be recognized. It would bring this error:

cnn_scraper.rb:10:in `block in <main>': undefined method `click' for #<Nokogiri:
:XML::Element:0x2c49b40> (NoMethodError)

The first paragraphs of all articles on CNN.com have the selector .adtag15090+ p. Also notice that it is parsing all articles and yet I want only five. Any ideas about how to get the first five and their first paragraphs using Nokogiri and Mechanize?

Why don't you call another nokogiri to parse content of headline page? Example: doc1 = Nokogiri::HTML(open(link)) — Thomas Tran, Feb 28 '14 at 05:01
`headline` will be a [Nokogiri::XML::Node](http://nokogiri.org/Nokogiri/XML/Node.html) object, and they don't know what `click` is. — the Tin Man, Mar 04 '14 at 23:18
@ThomasTran has the right idea. The only way to retrieve the content of the URL in `link` is to `open` it, then parse it to do things with its content. — the Tin Man, Mar 04 '14 at 23:20

Getting, visiting and limiting the number of links using Nokogiri and Mechanize?

0 Answers0

Linked