I am retrieving the latest news articles from cnn.com website, and wrote a simple Nokogiri script to do this:
url = "http://edition.cnn.com/?refresh=1"
doc = Nokogiri::HTML(open(url))
puts doc.at_css("title").text
doc.css("#cnn_maintt2bul div+ div a").each do |headline|
article = headline.text
puts "#{article}"
end
The problem is, CNN posts a mixture of articles and links to videos. Now I am only interested in articles not videos. When I run this script it retrieves all articles but leaves a space when an article links to a video, for example.
Pakistan airstrikes kill dozens
Could U.S. leave Afghanistan?
Editor's stabbing draws outrage
Ukrainian city fears uprising
U.S. hate groups in decline
This would mean that Ukrainian city fears uprising
would actually link to a video. It would do this until it retrieves the last article.
I discovered that the articles have a selector called .cnnVideoIcon
. Any ideas about how I could eliminate this such that articles linking to videos are removed from my results?
How would I eliminate such links when am parsing? They could appear anywhere.