3

Before I dive into writing a validator to check if a URL is actually pointing to an RSS feed, I did a bit of searching for some validators that may exist out there but had little luck with any reliable ones.

I just wanted to ask the community if any of you know of an RSS validator by URL?

If I were to write my own, what do you suggest?

I was thinking of just checking for the first instance of a line of text and making sure it defines <?xml version="1.0" encoding="UTF-8"?> and then perhaps checking that the next item is an <rss> node.

What are your thoughts here? Could there ever be a case where a feed may not follow the syntax stated above?

Also note, one method I attempted to use was the following:

$valid = true;

try{
    $content = file_get_contents($feed);
    if (!simplexml_load_string($content)){
        $valid = false;
    }
} catch (Exception $e){
    $valid = false;
}

Unfortunately it seems that I cannot suppress warnings (error_reporting(0) is not working..) so the just spams me with warnings.


SOLUTION

For anyone that is interested, I used the W3C Validator API

$url = "http://feed_url.com";
$validator = "http://validator.w3.org/feed/check.cgi";
$validator .= "?url=".$url;
$validator .= "&output=soap12";

$response = file_get_contents($validator);
$a = strpos($response, '<m:validity>', 0)+12; 
$b = strpos($response, '</m:validity>', $a); 
$result = substr($response, $a, $b-$a); 
echo $result;

This will return true or false accordingly.

Atticus
  • 6,585
  • 10
  • 35
  • 57
  • 1
    [Take a look in the manual](http://www.php.net/manual/en/function.simplexml-load-string.php), there is an option parameter for stuff like `LIBXML_NOERROR`, `LIBXML_NOWARNING` ([ref](http://www.php.net/manual/en/libxml.constants.php)). Have a nice read, your problem might just disappear then. – hakre Sep 27 '11 at 17:22
  • @Hakre, thanks! want to post this as an answer so I can give you credit? – Atticus Sep 27 '11 at 17:23
  • possible duplicate of [simplexml error handling php](http://stackoverflow.com/questions/1307275/simplexml-error-handling-php) – hakre Sep 27 '11 at 17:24
  • I bet this question has already been posted and an answer as well. Better delete the question and use the search next time - as long as you're only concerned about the errors. I mean this is not answering the RSS feed validator question probably. But check the related line-up on the right as well, like http://stackoverflow.com/questions/451338/validating-an-rss-feed and similar. – hakre Sep 27 '11 at 17:25
  • 1
    @Atticus Concerning your attempted method, just because a string is valid XML doesn't mean it's a valid RSS one. I would suggest you use [SimplePie](http://simplepie.org/) to handle this. If the [initialization](http://simplepie.org/wiki/reference/simplepie/init) of the object returns true you have a valid feed. – Shef Sep 27 '11 at 17:29
  • @Shef good point, I had that feeling too which is why I wanted to ensure it was an rss by checking the next item in the set to be an RSS node. However I'm not sure if there could be any other meta type of data before the RSS feed so I wanted to check – Atticus Sep 27 '11 at 17:32
  • @Atticus Do not try to handle it that way. Use SimplePie, as suggested on the last part of my previous comment. SimplePie is mature enough to account for all this. Maybe you can even extract the validation functions they use for the feeds, if the licence allows it. – Shef Sep 27 '11 at 17:36
  • @Shef Awesome resource, just checked it out. Thanks I'll look into this. +1 – Atticus Sep 27 '11 at 17:38

2 Answers2

4

The W3C Feed Validation Service offers a SOAP interface. From the About page:

Is there a Web Service with a public API for this service?

Yes, there is a SOAP interface, accessible by using the query parameter output="soap12" on top of a regular query. The SOAP 1.2 Web Service API documentation has more details.

Colin Brock
  • 21,267
  • 9
  • 46
  • 61
1

I would do this:

  1. Is it valid XML? If so, continue.

  2. Is the top-level element either rss or feed? If so, it's a feed. If not, it's not.

That covers all versions of RSS except 1.0 and all versions of Atom.

RSS 1.0 is more difficult since its top level element is RDF, and that's a more generic format than RSS, so you'd have to look deeper for indications of RSS-ness. But luckily there's not much RSS 1.0 out there these days, most of it is RSS 2.0 or Atom 1.0.

Hope this helps, with the usual disclaimers, I am not a lawyer, etc.

Dave Winer
  • 1,857
  • 1
  • 17
  • 13
  • Thanks Dave, this is what I was worried about.. The top level elements being different among RSS versions or Atom. Good tip, I think I'll use an exisitng API to avoid any exceptions like these I may not have known about. +1 :) – Atticus Sep 27 '11 at 17:44