i am trying to parse rss/atom-feeds in my rails app, but i encountered some serious problems with non-ASCII characters, eg. the german umlauts ÄÖÜ or ß. Some feeds in the wild use proper UTF-8, but some others make me cry. The general Problem is:
I must be able to parse any Feeds, whatever encoding they might have. The "loss" of characters is not an option (though its my current status), because i do some text and language analysis with the feed-items.
What i use so far:
- FeedZirra for fetching and parsing the feeds, works well so far. I also "sanitize" the values i get from FeedZirra.
- HTMLEntities (gem) for unescaping special characters, like
"Ä"
which means "Ä" - rCharDet19 gem, to figure out which encoding the feed might have, and to:
- string.encode! to convert from whatever it is to utf-8
- Ruby 1.9.3 (lastest) and Rails 3.2.8 on Ubuntu Linux 12.04
The problem is, that i literally have no idea what i'm doing wrong.
def self.sanitize_encoding_and_htmlentities str
cd = CharDet.detect str
s = str.encode(:invalid => :replace, :undef => :replace, :replace => '')
coder = HTMLEntities.new
coder.decode(s)
end
This is my current sanitize method. As sample-feed i use
http://www.N24.de/2/index.rss
So far, the "special" characters got replaced completely. This is the only variant i found which just works without raising an error due to invalid byte stuff. I changed the encode method slightly, because i read in the ruby doc that without any encoding given, the encode method should "translate" to the given default_internal Encoding of the app, which is utf-8 in my case. CharDet stands there just for possible changes to anything related, might be useful.
I used the magic_encoding gem, so every file in my project should have the comment on the first line. My database is sqlite3 with utf-8.
As of 2012, is there anything i should look at? Did i make anything really wrong?
Thanks for help!
EDIT: The feeds may be rss of any kind, atom, and/or just invalid XML. The Encoding may be UTF-8, something different, or just says "utf-8" while its some windows-XXX stuff, and so on. I really need a solution for this alltogether.
Also the fetching/parsing must be as fast as possible, that's why i picked feedzirra.
My current Idea is to get the feedcontent, replace every char in the "title" and "description" nodes with htmlentities if possible, use the encode! method to switch to utf-8, and then unescape the htmlentities. After this, special characters should be keeped i think, but i can't get something like this working at the moment. Might this be a good approach?