0

I am retrieving the latest news articles from cnn.com website, and wrote a simple Nokogiri script to do this:

url = "http://edition.cnn.com/?refresh=1"
doc = Nokogiri::HTML(open(url))
puts doc.at_css("title").text
  doc.css("#cnn_maintt2bul div+ div a").each do |headline|
  article = headline.text
  puts "#{article}"
end

The problem is, CNN posts a mixture of articles and links to videos. Now I am only interested in articles not videos. When I run this script it retrieves all articles but leaves a space when an article links to a video, for example.

Pakistan airstrikes kill dozens
Could U.S. leave Afghanistan?
Editor's stabbing draws outrage
Ukrainian city fears uprising

U.S. hate groups in decline

This would mean that Ukrainian city fears uprising would actually link to a video. It would do this until it retrieves the last article.

I discovered that the articles have a selector called .cnnVideoIcon. Any ideas about how I could eliminate this such that articles linking to videos are removed from my results?

How would I eliminate such links when am parsing? They could appear anywhere.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Wasswa Samuel
  • 2,139
  • 3
  • 30
  • 54

3 Answers3

2

I looked at the HTML source code of the CNN site and found that the "li" tag of a video headline has four child elements, and only three child elements with text headlines.

<li class="c_hpbullet3" data-vr-contentbox=""> 
   <span class="cnnPreWOOL"></span> 
   <a href="/video/data/2.0/video/world/2014/02/25/ctw-ukraine-political-aftermath-ian-bremmer-intv.cnn.html?hpt=hp_t5">Ukrainian politics remain in flux</a> 
   <span class="cnnPostWOOL"></span> &nbsp;
   <a href="/video/data/2.0/video/world/2014/02/25/ctw-ukraine-political-aftermath-ian-bremmer-intv.cnn.html?hpt=hp_t5" target=""><img class="cnnVideoIcon" width="16" height="10" border="0" alt="Ukrainian politics remain in flux" src="http://i.cdn.turner.com/cnn/.e/img/3.0/global/icons/video_icon.gif"></a> 
</li>

So, we can use the XPath syntax below:

doc.xpath("//div[@id='cnn_maintt2bul']/div/div/ul/li[count(*)=3]/a").each do |headline|
  article = headline.text
  puts "#{article}"
end
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Thomas Tran
  • 467
  • 4
  • 8
  • How would I get the links to each article such they are clickable and also retrieve the first paragraph of each article. – Wasswa Samuel Feb 26 '14 at 11:42
  • Could you help me out with this question http://stackoverflow.com/questions/22055544/getting-visiting-and-limiting-the-number-of-links-using-nokogiri-and-mechanize – Wasswa Samuel Feb 27 '14 at 11:13
0

You should use something else than the CSS attributes to find the desired tags. Use search instead of css and give it an XPath that only selects the elements that don't have the link to a video as child.

I will update the answer with a designated XPath when you provide a real URL to the site you want to fetch information from.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Severin
  • 8,508
  • 14
  • 68
  • 117
0

If you look at the source code of the blocks you're scraping from http://edition.cnn.com/?refresh=1, you will notice that videos are a link with a video icon (and no text), like so:

<a href="/video/data/...">
   <img class="cnnVideoIcon" alt="Ukrainian city fears uprising" ... 
        height="10" width="16">
</a>

This explains why you get some empty lines.

You could skip those links using a more refined selector like:

#cnn_maintt2bul div + div a:empty

Using a:empty, you will only retrieve links without images or other elements inside, or, in other words, all links with a description text only.


Another (less optimized) approach is to simply skip the empty lines with an if statement:

doc.css("#cnn_maintt2bul div + div a").each do |headline|
article = headline.text
if (article != "")
    puts "#{article}"
...
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Fabrizio Calderan
  • 120,726
  • 26
  • 164
  • 177