You're dealing with RSS/Atom feeds which can contain multiple title
tags. You need to iterate over all title
nodes and extract their content separately, in a way that lets you keep track of their order and what article they're attached to:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<feed xmlns="http://www.w3.org/2005/Atom">
<entry>
<title type="html">
<![CDATA[ First Post! ]]>
</title>
<content type="html">
<![CDATA[
<p>I’m very excited to have finally got my site up and running along with this blog!</p>]]>
</content>
</entry>
</feed>
EOT
doc.search('title').map(&:text)
# => ["\n First Post! \n "]
This returns an array of the text inside the title
nodes. From there you can easily clean up each string, manipulate them, reuse them, whatever.
doc.search('title').map{ |s| s.text.strip }
# => ["First Post!"]
search
returns a NodeSet, which is akin to an array of title
nodes found in the document. If you don't iterate over them you'll get a concatenated string containing all their text, which is usually NOT what you want:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<foo>
<title>this</title>
<title>is</title>
<title>what</title>
<title>you'd</title>
<title>get</title>
</foo>
EOT
doc.search('title').text
# => "thisiswhatyou'dget"
versus:
doc.search('title').map(&:text)
# => ["this", "is", "what", "you'd", "get"]
Trying to tear apart the first result is impossible unless you have prior knowledge of the document's structure which is usually not true. Iterating over the returned NodeSet will yield very usable results.
To maintain consistency with the various title
tags in a feed, you need to loop over the entries, then extract the embedded titles which is a bit different than what your sample XML and code shows:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<feed xmlns="http://www.w3.org/2005/Atom">
<entry>
<title type="html">
<![CDATA[ First Post! ]]>
</title>
<content type="html">
<![CDATA[
<p>I’m very excited to have finally got my site up and running along with this blog!</p>]]>
</content>
</entry>
<entry>
<title type="html">
<![CDATA[ Second Post! ]]>
</title>
<content type="html">
<![CDATA[
<p>blah</p>]]>
</content>
</entry>
</feed>
EOT
titles = doc.search('entry').map { |entry|
entry.at('title').text.strip
}
titles # => ["First Post!", "Second Post!"]
Or perhaps more usable:
titles_and_content = doc.search('entry').map { |entry|
[
entry.at('title').text.strip,
entry.at('content').text.strip
]
}
titles_and_content
# => [["First Post!",
# "<p>I’m very excited to have finally got my site up and running along with this blog!</p>"],
# ["Second Post!", "<p>blah</p>"]]
which returns the title and the content for each entry. From this you can easily build up code to extract the links to the articles, date of publishing, refresh-rates, original site, everything you'd want to know about an individual article and its source, then store it in a database for later regurgitation when requested.
There are gems and scripts available for processing RDF, RSS and Atom feeds, however, years ago, when I had to write a huge aggregator for feeds, nothing was available that met my needs and I wrote one from scratch. I'd recommend trying to find one rather than reinvent that wheel, otherwise look through their source and learn from their experience. There are a number of things to do in code to be a good network-citizen that doesn't swamp the servers and get you banned.
See "How to avoid joining all text from Nodes when scraping" also.