Malformed Nokogiri parse from XML file accessed from URL?

Question

Currently I'm trying to parse an XML document provided by the BBC. However, I am doing a simple check of what Ruby actually gets, and it appears to be missing a lot of details.

require 'open-uri'
require 'nokogiri'

class MainController < ApplicationController

def index
    @xml = Nokogiri::XML(open("http://www.bbc.co.uk/bbcone/programmes/schedules/scotland/2013/12/13.xml"))

    render :text => @xml
    end
end

All that I get from the output, truncated for size, is a heap of incoherent text:

 p01ml65v 2013-12-13T00:20:00Z 2013-12-13T00:25:00Z 300 b03ktclr Detailed weather forecast. audio_video 300 p01lc1h3 Skiing Weatherview 2013-12-13T00:20:00Z b007yy70 2007-09-02T01:50:00+01:00 0 0 p01ml65w 2013-12-13T00:25:00Z 2013-12-13T06:00:00Z 20100 b03ktclt BBC One joins the BBC's rolling news channel for a night of news. audio_video 20100 p01m1rbq 13/12/2013 2013-12-13T00:25:00Z b00h9fxh 2006-04-05T00:20:00+01:00 0 0 p01ml966 2013-12-13T06:00:00Z 2013-12-13T09:15:00Z 11700 b03ktcn1

It's also missing quite a lot of children. Can you shed some light on how I might approach this issue?

The end-goal at the moment is just to display the title of the show, found in the tree node /schedule/day/broadcasts/broadcast/programme/display_titles/title initially, and the rest will follow once that's done.

You need to provide a summarized, simplified, sample of the XML *in your question*. Don't expect us to chase down the XML; We're volunteering our time as it is, and looking it up wastes our time. Also, *when* that link breaks your question will make no sense. It'd also help you to read how to use [Nokogiri](http://nokogiri.org). The tutorials will help you a great deal. "Questions concerning problems with code you've written must describe the specific problem — and include valid code to reproduce it — in the question itself. See http://SSCCE.org for guidance." — the Tin Man, Dec 13 '13 at 19:29
It looks like you're getting the text content of the XML nodes. Either that or when you're viewing the result in your browser, it's trying to interpret the XML as HTML. Try viewing source to see what you're actually getting. If you want to be able to see the XML of what you have, try `render :xml =>` instead to get the right content type, and look at Nokogiri's documentation on how to get the full XML instead of the text content. — carols10cents, Dec 14 '13 at 15:45

score 0 · Accepted Answer · edited May 23 '17 at 12:09

I'm not going to hand you an answer, because it doesn't look like you tried reading Nokogiri's documentation.

What I will do is point you in the general direction:

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::XML(open("http://www.bbc.co.uk/bbcone/programmes/schedules/scotland/2013/12/13.xml"))

episode = doc.at('programme[type="episode"]')
episode.at('title').text # => "Skiing Weatherview"
episode.at('short_synopsis').text # => "Detailed weather forecast."

doc.search('broadcast').size # => 32
doc.search('title').map(&:text).uniq.sort
# => ["13/12/2013",
#     "14/12/2013",
#     "A Question of Sport",
...

Having the parsed document as a DOM isn't sufficient. You need to retrieve the nodes you want. You can do that by using at, which finds the first matching node, or search, which finds all matching nodes.

See "How to avoid joining all text from Nodes when scraping" also.

Malformed Nokogiri parse from XML file accessed from URL?

1 Answers1