I'm trying to create a feed aggregator using rome (1.0). Everything is working, but I'm facing problems with feed's charset. I'm developing it using java 1.6 over a mac os x (netbeans 6.9.1).
I'm using the following code to retrieve feeds:
InputStream is = new URL(_source).openConnection().getInputStream();
SyndFeed feed = (SyndFeed) input.build(new InputStreamReader(is, Charset.forName(_charset)));
Where _source
is a rss source (like http://rss.cnn.com/rss/edition.rss) and _charset
is UTF-8 or ISO-8859-1.
It works, but some sites with latin characters (like portuguese) it doesn't even if I use both encodings.
For instance, feeds read from http://oglobo.globo.com/rss/plantaopais.xml will always return dummy characters as following:
Secret�rio de S�o Paulo (UTF-8)
Secretário de São Paulo (ISO-8859-1)
Why? Am I missing something?
If I try to use something like UTF-16, rome throws an error: com.sun.syndication.io.ParsingFeedException: Invalid XML: Error on line 1: Content is not allowed in prolog.
I've tried other encodings, like US-ASCII with no lucky...
Another question: is rome the best solution to deal with feeds (using java)? The most recent version from rome is 1.0 that is dated from 2009. Seems to be dead...
TIA,
Bob