5

I'm trying to determine whether a given feed is Atom based or RSS based.

Here's my code:

public boolean isRSS(String URL) throws ParserConfigurationException, SAXException, IOException{
        DocumentBuilder builder = DocumentBuilderFactory.newInstance()
                .newDocumentBuilder();
        Document doc = builder
                .parse(URL);
        return doc.getDocumentElement().getNodeName().equalsIgnoreCase() == "rss";
    }

Is there a better way to do it? would it be better if I used a SAX Parser instead?

Dan Lowe
  • 51,713
  • 20
  • 123
  • 112
Mahmoud Hanafy
  • 7,958
  • 12
  • 47
  • 62

3 Answers3

4

The root element is the easiest way to determine the type of a feed.

For different Parsers there are different ways to get the root element. None is inferior to the other. There has been written enough about StAX vs. SAX vs. DOM etc, which can be used as basis for a specific decision.

There is nothing wrong with your first two lines of code:

DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(URL);

In your return statement you make a mistake on Java String comparison.

When you use the comparison operator == with Strings, it compares references not values (i.e. you check if both are exactly the same object). You should use the equals() method here. Just to be sure I would recommend to use equalsIgnoreCase():

return doc.getDocumentElement().getNodeName().equalsIgnoreCase("rss");

Hint: If you check for "rss" instead of "feed" (like for Atom) in your isRss() method you don't have to use the ternary operator.

Chris
  • 7,864
  • 1
  • 27
  • 38
  • Yeah, I know I don't have to, I wrote the question when I was really sleepy, sorry about that. – Mahmoud Hanafy Oct 01 '11 at 22:25
  • 1
    @MahmoudHossam No problem, but your updated return statement (return !(doc.getDocumentElement().getNodeName() == "feed");) also won't work because of the described comparison problem. – Chris Oct 01 '11 at 22:34
  • After a few hours of looking around on how to normalize / create a general way to parse the differing rss feed formats - this is what I came up with as well. Good answer. – Edward Jan 01 '18 at 05:13
3

Sniffing content is one method. But note that atom uses namespaces, and you are creating a non namespace aware parser.

public boolean isAtom(String URL) throws ParserConfigurationException, SAXException, IOException{
    DocumentBuilderFactory f = DocumentBuilderFActory.newInstance();
    f.setNamespaceAware(true);
    DocumentBuilder builder = f.newInstance().newDocumentBuilder();
    Document doc = builder.parse(URL);
    Element e = doc.getDocumentElement(); 
    return e.getLocalName().equals("feed") && 
            e.getNamespaceURI().equals("http://www.w3.org/2005/Atom");
}

Note also that you cannot compare using equalsIgnorCase(), since XML element names are case sensitive.

Another method is to react on the Content-Type header, if it is available in a HTTP GET request. Content-Type for ATOM would be application/atom+xml and for RSS application/rss+xml. I would suspect though, that not all RSS feed can be trusted to correctky set this header.

A third option is to look at the URL suffix, e.g. .atom and .rss.

The last two methods are easily configurable if you are using Spring or JAX-RS

forty-two
  • 12,204
  • 2
  • 26
  • 36
  • I'd like your approach in a perfect world. :) In my experience you will have to cope with a whole bunch of in-the-wild-feeds ignoring standards like Content-Type, suffixes or case of XML elements. That's why I suggested an equalsIgnoreCase()-check of the root element, since that's almost always correct. – Chris Oct 02 '11 at 22:39
  • @Chris. I give you that the world is non perfect and the feed business is chaotic. Just look at the [ROME](http://java.net/projects/rome/) source code. But, at least use a name space aware XML parser, please! – forty-two Oct 02 '11 at 23:02
  • I think I can use both methods, one checks for RSS, the other for Atom. – Mahmoud Hanafy Oct 05 '11 at 05:03
2

You could use a StAX parser to avoid parsing the entire XML document into memory:

public boolean isAtom(String url) throws ParserConfigurationException, SAXException, IOException{
    XMLInputFactory xif = XMLInputFactory.newFactory();
    XMLStreamReader xsr = xif.createXMLStreamReader(new URL(url).openConnection());
    xsr.nextTag();  // Advance to root element
    return xsr.getLocalName().equals("feed") && 
            xsr.getNamespaceURI().equals("http://www.w3.org/2005/Atom");
}
bdoughan
  • 147,609
  • 23
  • 300
  • 400
  • I'm going to be using this in an Android application, so I'm not sure if Android has a StAX parser built in, and I don't want to add extra dependencies since I'm going to add a library for each feed type already. – Mahmoud Hanafy Oct 05 '11 at 05:05
  • 1
    @MahmoudHossam - Android has `XmlPullParser` which is its own version of a StAX parser: http://developer.android.com/reference/org/xmlpull/v1/XmlPullParser.html – bdoughan Oct 05 '11 at 10:56