Feed Encoding Problems Ruby 1.9

Question

i am trying to parse rss/atom-feeds in my rails app, but i encountered some serious problems with non-ASCII characters, eg. the german umlauts ÄÖÜ or ß. Some feeds in the wild use proper UTF-8, but some others make me cry. The general Problem is:

I must be able to parse any Feeds, whatever encoding they might have. The "loss" of characters is not an option (though its my current status), because i do some text and language analysis with the feed-items.

What i use so far:

FeedZirra for fetching and parsing the feeds, works well so far. I also "sanitize" the values i get from FeedZirra.
HTMLEntities (gem) for unescaping special characters, like "Ä" which means "Ä"
rCharDet19 gem, to figure out which encoding the feed might have, and to:
string.encode! to convert from whatever it is to utf-8
Ruby 1.9.3 (lastest) and Rails 3.2.8 on Ubuntu Linux 12.04

The problem is, that i literally have no idea what i'm doing wrong.

  def self.sanitize_encoding_and_htmlentities str
    cd = CharDet.detect str
    s = str.encode(:invalid => :replace, :undef => :replace, :replace => '')
    coder = HTMLEntities.new
    coder.decode(s)
  end

This is my current sanitize method. As sample-feed i use

http://www.N24.de/2/index.rss

So far, the "special" characters got replaced completely. This is the only variant i found which just works without raising an error due to invalid byte stuff. I changed the encode method slightly, because i read in the ruby doc that without any encoding given, the encode method should "translate" to the given default_internal Encoding of the app, which is utf-8 in my case. CharDet stands there just for possible changes to anything related, might be useful.

I used the magic_encoding gem, so every file in my project should have the comment on the first line. My database is sqlite3 with utf-8.

As of 2012, is there anything i should look at? Did i make anything really wrong?

Thanks for help!

EDIT: The feeds may be rss of any kind, atom, and/or just invalid XML. The Encoding may be UTF-8, something different, or just says "utf-8" while its some windows-XXX stuff, and so on. I really need a solution for this alltogether.

Also the fetching/parsing must be as fast as possible, that's why i picked feedzirra.

My current Idea is to get the feedcontent, replace every char in the "title" and "description" nodes with htmlentities if possible, use the encode! method to switch to utf-8, and then unescape the htmlentities. After this, special characters should be keeped i think, but i can't get something like this working at the moment. Might this be a good approach?

score 0 · Accepted Answer · answered Aug 30 '12 at 07:45

Finally i found the main Problem:

Feedzirra already returns UTF-8 when accessing entries and their attributes. But i used the sanitize method to access attributes, which returns ASCII-8BIT and weird characters escaped as html-entities.

However, i kicked all the sanitizing and encoding stuff out of my code, and now it just works. Seems that FeedZirra has something built in to transcode the feeds if neccessary.

Feed Encoding Problems Ruby 1.9

1 Answers1