1

I'm creating a Ruby on Rails application and using Nokogiri to parse an XML file. I'm trying to parse the XML file into mutable strings which I can manipulate to create other content.

Here's a sample XML I'm using

<feed xmlns="http://www.w3.org/2005/Atom">
  <entry>
    <title type="html">
      <![CDATA[ First Post! ]]>
    </title>
    <content type="html">
      <![CDATA[
        <p>I&rsquo;m very excited to have finally got my site up and running along with this blog!</p>]]>
    </content>
  </entry>
</feed>

This is what I've done so far relating to my problem

In my controller -

def index
    @blog_title, @blog_post = parse_xml
end

private
def parse_xml
    @xml_doc = Nokogiri::XML(open("atom.xml"))
    titles = @xml_doc.css("entry title")
    post = @xml_doc.css("content")
    return titles, post
end

In my view -

<% for i in 1..@blog_title.length %>
    <li><%= @blog_title[i-1] %></li>
    <li><%= @blog_post[i-1] %></li>
<% end %>

A sample output from the view (it returns a Nokogiri Element) -

<title type="html"><![CDATA[First Post!]]></title>

So ideally, I'd like to make all the Nokogiri::Element inside the Nokogiri::Document a string or make the entire array a String array.

I've tried iterating through each element and calling .to_s but it doesn't seem to work.

I've also tried calling Ruby::String methods such as slice and that doesn't work (for obvious reasons).

The end result I'm trying to get at (using the sample output on my view) is to return only the following and none of the rest.

First Post!

Can anyone help me? If I'm not clear enough or if someone needs to see more work, please feel free to ask!

Tony
  • 219
  • 4
  • 17
  • 1
    Use `.text`, something like `titles.text` gives you what do you want. – Yevgeniy Anfilofyev Jul 15 '15 at 09:10
  • I'm so embarassed thank you, I've tried .text earlier but I was doing it in the controller which now I realize didn't change anything. So I just called it in my view which is perfect :), if it's possible can you post that as the answer so that I may mark it correct. – Tony Jul 15 '15 at 09:16

2 Answers2

0

For your case you should simply use .text to extract the content of tags. Something like titles.text would work.

Yevgeniy Anfilofyev
  • 4,827
  • 25
  • 27
  • While it seems like `titles.text` will work, in reality it's going to return a concatenated string of all text inside all found `title` tags, which is seldom what we want. After retrieving that, then we have to figure out how to split the text into separate/understandable strings. – the Tin Man Jul 15 '15 at 17:49
0

You're dealing with RSS/Atom feeds which can contain multiple title tags. You need to iterate over all title nodes and extract their content separately, in a way that lets you keep track of their order and what article they're attached to:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT)
<feed xmlns="http://www.w3.org/2005/Atom">
  <entry>
    <title type="html">
      <![CDATA[ First Post! ]]>
    </title>
    <content type="html">
      <![CDATA[
        <p>I&rsquo;m very excited to have finally got my site up and running along with this blog!</p>]]>
    </content>
  </entry>
</feed>
EOT

doc.search('title').map(&:text)
# => ["\n       First Post! \n    "]

This returns an array of the text inside the title nodes. From there you can easily clean up each string, manipulate them, reuse them, whatever.

doc.search('title').map{ |s| s.text.strip }
# => ["First Post!"]

search returns a NodeSet, which is akin to an array of title nodes found in the document. If you don't iterate over them you'll get a concatenated string containing all their text, which is usually NOT what you want:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT)
<foo>
  <title>this</title>
  <title>is</title>
  <title>what</title>
  <title>you'd</title>
  <title>get</title>
</foo>
EOT

doc.search('title').text
# => "thisiswhatyou'dget"

versus:

doc.search('title').map(&:text)
# => ["this", "is", "what", "you'd", "get"]

Trying to tear apart the first result is impossible unless you have prior knowledge of the document's structure which is usually not true. Iterating over the returned NodeSet will yield very usable results.

To maintain consistency with the various title tags in a feed, you need to loop over the entries, then extract the embedded titles which is a bit different than what your sample XML and code shows:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT)
<feed xmlns="http://www.w3.org/2005/Atom">
  <entry>
    <title type="html">
      <![CDATA[ First Post! ]]>
    </title>
    <content type="html">
      <![CDATA[
        <p>I&rsquo;m very excited to have finally got my site up and running along with this blog!</p>]]>
    </content>
  </entry>
  <entry>
    <title type="html">
      <![CDATA[ Second Post! ]]>
    </title>
    <content type="html">
      <![CDATA[
        <p>blah</p>]]>
    </content>
  </entry>
</feed>
EOT

titles = doc.search('entry').map { |entry|
  entry.at('title').text.strip
}
titles # => ["First Post!", "Second Post!"]

Or perhaps more usable:

titles_and_content = doc.search('entry').map { |entry|
  [
    entry.at('title').text.strip,
    entry.at('content').text.strip
  ]
}
titles_and_content 
# => [["First Post!",
#      "<p>I&rsquo;m very excited to have finally got my site up and running along with this blog!</p>"],
#     ["Second Post!", "<p>blah</p>"]]

which returns the title and the content for each entry. From this you can easily build up code to extract the links to the articles, date of publishing, refresh-rates, original site, everything you'd want to know about an individual article and its source, then store it in a database for later regurgitation when requested.

There are gems and scripts available for processing RDF, RSS and Atom feeds, however, years ago, when I had to write a huge aggregator for feeds, nothing was available that met my needs and I wrote one from scratch. I'd recommend trying to find one rather than reinvent that wheel, otherwise look through their source and learn from their experience. There are a number of things to do in code to be a good network-citizen that doesn't swamp the servers and get you banned.

See "How to avoid joining all text from Nodes when scraping" also.

Community
  • 1
  • 1
the Tin Man
  • 158,662
  • 42
  • 215
  • 303