0

I am trying to scrape the five latest stories from CNN.com and retrieve their links along with the first paragraph of each story. I have this simple script:

url = "http://edition.cnn.com/?refresh=1"
agent = Mechanize.new
agent.get("http://edition.cnn.com/?refresh=1").search("//div[@id='cnn_maintt2bul']/div/div/ul/li[count(*)=3]/a").each do |headline| 
 article = headline.text
 link = URI.join(url, headline[:href]).to_s
 page = headline.click(link)
 paragraph1 = page.at_css(".adtag15090+ p").text
 puts "#{article}"
 puts "#{link}"
 puts "#{paragraph1}"
 puts "\n"
end

This code won't work because the click method would not be recognized. It would bring this error:

cnn_scraper.rb:10:in `block in <main>': undefined method `click' for #<Nokogiri:
:XML::Element:0x2c49b40> (NoMethodError)

The first paragraphs of all articles on CNN.com have the selector .adtag15090+ p. Also notice that it is parsing all articles and yet I want only five. Any ideas about how to get the first five and their first paragraphs using Nokogiri and Mechanize?

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Wasswa Samuel
  • 2,139
  • 3
  • 30
  • 54
  • 1
    Why don't you call another nokogiri to parse content of headline page? Example: doc1 = Nokogiri::HTML(open(link)) – Thomas Tran Feb 28 '14 at 05:01
  • `headline` will be a [Nokogiri::XML::Node](http://nokogiri.org/Nokogiri/XML/Node.html) object, and they don't know what `click` is. – the Tin Man Mar 04 '14 at 23:18
  • @ThomasTran has the right idea. The only way to retrieve the content of the URL in `link` is to `open` it, then parse it to do things with its content. – the Tin Man Mar 04 '14 at 23:20

0 Answers0