0

With the help of the kind SO people, i successfully implemented a simple rss downloader in haskell. But one question remains: how to fix the broken encoding of the feed item title? Here is the minimal working example:

import Control.Monad
import Control.Applicative
import Network.HTTP
import Text.Feed.Import
import Text.Feed.Query
import Text.Feed.Types
import Data.Maybe
import qualified Data.ByteString as B
import Network.URI (parseURI, uriToString)
import Codec.Binary.UTF8.String (decodeString, encodeString)

getTitleAndUrl :: Item -> (Maybe String, Maybe String)
getTitleAndUrl item = (getItemTitle item, getItemLink item)

downloadUri :: (Maybe String,Maybe String) -> IO ()
downloadUri (Just title,Just link) = do
  item <- get link
  B.writeFile title item
    where
      get url = let uri = case parseURI url of
                      Nothing -> error $ "invalid uri" ++ url
                      Just u -> u in
                simpleHTTP (defaultGETRequest_ uri) >>= getResponseBody
downloadUri _ = print "Somewhere something went Nothing"

getTuples :: IO (Maybe [(Maybe String, Maybe String)])
getTuples = fmap (map getTitleAndUrl) <$> fmap (feedItems) <$> parseFeedString <$> decodeString <$> (simpleHTTP (getRequest "http://index.hu/24ora/rss/") >>= getResponseBody)

main = getTuples

It prints out like

Just [...,(Just "Gyalogosbaleset miatt \225ll a t\246megk\246zleked\233s a Margit h\237don",Just "http://velvet.hu/blogok/helyszinelo/2013/06/18/gyalogossbaleset_miatt_all_a_tomegkozlekedes_a_margit_hidon/"),...]

I made some research, the feed has its item titles sorrounded by <![CDATA[]], so the xml parser skips them.

Example item:

<item>
        <title><![CDATA[Gyalogosbaleset miatt áll a tömegközlekedés a Margit hídon]]></title>
        <link>http://velvet.hu/blogok/helyszinelo/2013/06/18/gyalogossbaleset_miatt_all_a_tomegkozlekedes_a_margit_hidon/</link>
        <pubDate>Tue, 18 Jun 2013 09:08:00 +0200</pubDate>
        <category domain="main"></category>
        <description><![CDATA[A tájékoztatás szerint a budai hídfő megállójában elesett egy gyalogos, jelenleg pótlóbuszok közlekednek.]]></description>
        <content:encoded><![CDATA[A tájékoztatás szerint a budai hídfő megállójában elesett egy gyalogos, jelenleg pótlóbuszok közlekednek.]]></content:encoded>
</item>

How can i force utf8 encoding to this string?

Community
  • 1
  • 1
pasja
  • 365
  • 4
  • 10
  • I'm not sure I understand the question. What's the behavior here that you don't like? What does UTF-8 have to do with it? How would it behave differently if it was behaving the way you want? – shachaf Jun 18 '13 at 08:28
  • @shachaf: t\246megk\246zleked\233s -> tömegközlekedés and so on... – pasja Jun 18 '13 at 08:32
  • 1
    OK. 1: Haskell `String`s are Unicode strings. They're not UTF-8 or UTF-anything -- they're just lists of Unicode codepoints. 2: You're just looking at the result of `show` for a string. That's how the `Show` instance works -- you're not going to be able to do anything about that. If you print the string -- e.g. with `putStrLn` -- you'll see that it prints fine. The string is correct, it's just that the way you're looking at it escapes some characters. – shachaf Jun 18 '13 at 08:37
  • @shachaf: Thanks, now i understand it. I would happily accept this as an answer. – pasja Jun 18 '13 at 08:48

1 Answers1

8

OK, I'll just copy my comment down here:

  1. Haskell Strings are Unicode strings. They're not UTF-8 or UTF-anything -- they're just lists of Unicode codepoints.

  2. You're just looking at the result of show for a string. That's how the Show instance works -- you're not going to be able to do anything about that. If you print the string -- e.g. with putStrLn -- you'll see that it prints fine. The string is correct, it's just that the way you're looking at it escapes some characters.

shachaf
  • 8,890
  • 1
  • 33
  • 51