Problem with charset and rome (rss/atom feeds)

Question

I'm trying to create a feed aggregator using rome (1.0). Everything is working, but I'm facing problems with feed's charset. I'm developing it using java 1.6 over a mac os x (netbeans 6.9.1).

I'm using the following code to retrieve feeds:

InputStream is = new URL(_source).openConnection().getInputStream();
SyndFeed feed = (SyndFeed) input.build(new InputStreamReader(is, Charset.forName(_charset)));

Where _source is a rss source (like http://rss.cnn.com/rss/edition.rss) and _charset is UTF-8 or ISO-8859-1.

It works, but some sites with latin characters (like portuguese) it doesn't even if I use both encodings.

For instance, feeds read from http://oglobo.globo.com/rss/plantaopais.xml will always return dummy characters as following:

Secret�rio de S�o Paulo (UTF-8)
SecretÃ¡rio de SÃ£o Paulo (ISO-8859-1)

Why? Am I missing something?

If I try to use something like UTF-16, rome throws an error: com.sun.syndication.io.ParsingFeedException: Invalid XML: Error on line 1: Content is not allowed in prolog.

I've tried other encodings, like US-ASCII with no lucky...

Another question: is rome the best solution to deal with feeds (using java)? The most recent version from rome is 1.0 that is dated from 2009. Seems to be dead...

TIA,

Bob

This is related to http://stackoverflow.com/questions/8473410/while-parsing-rss-feed-through-rome-getting-content-is-not-allowed-in-prolog/14557915#14557915. There is no longer content on the feed in the question so I was unable to test if it was due to a byte ordering problem. — Mark Butler, Jan 28 '13 at 08:12

Paŭlo Ebermann · Accepted Answer · 2011-03-20T22:14:09.003

I don't know rome (you could have put a link in your question). ISO-8859-1 should be the right encoding to use for the feed you linked. But doesn't your library supports an InputStream as a source (so it would itself look up the right encoding by the XML preamble)?

Could it be that the output is garbled after it's processing by the output of your program? Could you write

System.out.println("S\u00e3o Paulo");

in your program and report its output? (It should be "São Paulo" if your Java + console combination is configured right.)

So, I now downloaded and compiled Rome (which took half an hour of downloading of other stuff by Maven), and I can reproduce the problem. Looks like the build method taking a Reader has problems.

Here is a variant that works (if rome, jdom and xerces are in the class path):

package de.fencing_game.paul.examples.rome;

import org.xml.sax.InputSource;

import java.nio.charset.Charset;
import java.io.*;
import java.net.*;

import com.sun.syndication.io.*;
import com.sun.syndication.feed.synd.*;

public class RomeTest {

    public static void main(String[] ignored)
        throws IOException, FeedException
    {
        String charset = "UTF-8";
        String url = "http://oglobo.globo.com/rss/plantaopais.xml";


        InputStream is = new URL(url).openConnection().getInputStream();
        InputSource source = new InputSource(is);

        SyndFeedInput input = new SyndFeedInput();
        SyndFeed feed = input.build(source);

        System.out.println("description: " + feed.getDescription());
    }


}

By using an InputSource with an InputStream instead of a Reader, the parser itself finds out the right charset, and gets it right.

Digging a bit around in the source, it seems our SyndFeed passes the Reader or InputSource to JDOM, which in turn passes it to the SAX XMLReader, which seems to get confused if confronted with a Reader which presents itself with <?xml ... encoding="ISO-8859-1" ?>. I then dug around in the source of Xerces (which seem to be the one used here), but didn't find anything suspicious which would cause this.

Thanks for your answer. I provided the link to rome project. I tried both ways, with and without defining the encoding for InputStream. The results were the same (without specifying the result was the same using UTF-8). I did the test and it worked. It printed São Paulo correctly. — Bob Rivers, Mar 20 '11 at 19:12
Thank you very much. It worked nicely when implement the way you suggested. — Bob Rivers, Mar 21 '11 at 01:00
@Bob: This has the other benefit of adjusting itself to any changes of the source encoding. — Paŭlo Ebermann, Mar 21 '11 at 01:07
@PaŭloEbermann thanks very much! I ran into this issue using the Restlet library's ROME extension (I'll be submitting a fix proposal to the Restlet folks). You saved me a ton of time. — Andy Dennie, Mar 09 '12 at 16:21

Problem with charset and rome (rss/atom feeds)

1 Answers1