15

I was wondering if I can write a Haskell program to check updates of some novels on demand, and the website I am using as an example is this. And I got a problem when displaying the contents of it (on a mac el capitan). The simple codes follow:

import Network.HTTP

openURL :: String -> IO String
openURL = (>>= getResponseBody) . simpleHTTP . getRequest

display :: String -> IO ()
display = (>>= putStrLn) . openURL

Then, when I run display "http://www.piaotian.net/html/7/7430/" on ghci, some strange characters appear; the first lines look like this:

<title>×ß½øÐÞÏÉ×îÐÂÕ½Ú,×ß½øÐÞÏÉÎÞµ¯´°È«ÎÄÔĶÁ_Æ®ÌìÎÄѧ</title>
<meta http-equiv="Content-Type" content="text/html; charset=gbk" />
<meta name="keywords" content="×ß½øÐÞÏÉ,×ß½øÐÞÏÉ×îÐÂÕ½Ú,×ß½øÐÞÏÉÎÞµ¯´° Æ®ÌìÎÄѧ" />
<meta name="description" content="Æ®ÌìÎÄÑ§ÍøÌṩ×ß½øÐÞÏÉ×îÐÂÕ½ÚÃâ·ÑÔĶÁ£¬Ç뽫×ß½øÐÞÏÉÕ½ÚĿ¼¼ÓÈëÊղط½±ãÏ´ÎÔĶÁ,Æ®ÌìÎÄѧС˵ÔĶÁÍø¾¡Á¦ÔÚµÚһʱ¼ä¸üÐÂС˵×ß½øÐÞÏÉ£¬Èç·¢ÏÖδ¼°Ê±¸üУ¬ÇëÁªÏµÎÒÃÇ¡£" />
<meta name="copyright" content="×ß½øÐÞÏɰæÈ¨ÊôÓÚ×÷ÕßÎáµÀ³¤²»¹Â" />
<meta name="author" content="ÎáµÀ³¤²»¹Â" />
<link rel="stylesheet" href="/scripts/read/list.css" type="text/css" media="all" />
<script type="text/javascript">

I also tried to download as a file as follows:

import Network.HTTP

openURL :: String -> IO String
openURL = (>>= getResponseBody) . simpleHTTP . getRequest

downloading :: String -> IO ()
downloading = (>>= writeFile fileName) . openURL

But after downloading the file, it is like in the photo: enter image description here

If I download the page by python (using urllib for example) the characters are displayed normally. Also, if I write a Chinese html and parse it, then there seems to be no problem. Thus it seems that the problem is on the website. However, I don't see any difference between the characters of the site and those I write.

Any help on the reason behind this is well appreciated.

P.S.
The python code is as follows:

import urllib

urllib.urlretrieve('http://www.piaotian.net/html/7/7430/', theFic)

theFic = file_path

And the file is all fine and good.

Bergi
  • 630,263
  • 148
  • 957
  • 1,375
awllower
  • 571
  • 1
  • 9
  • 21
  • Is this question inappropriate on this site? Or have I made any mistakes? If so, thanks in advance for pointing it to me. I would like to know the reason for the quick down-votes. And sorry for the inconvenience caused. – awllower Aug 01 '16 at 14:14
  • 4
    IMO this seems to be a perfectly good question; don't know why people downvote this. It would perhaps be useful to also give the working Python example, for comparison. – leftaroundabout Aug 01 '16 at 14:17
  • FWIW, the problem also happens with the [download-curl](http://hackage.haskell.org/package/download-curl-0.1.4/docs/Network-Curl-Download.html) library instead of HTTP. It seems like Haskell just doesn't support the GBK character encoding... I suppose if you write your own html, you do it in UTF-8? – leftaroundabout Aug 01 '16 at 14:27
  • 2
    The problem also happens with just the `curl` command – pdexter Aug 01 '16 at 14:31
  • @leftaroundabout I added a minimal python code, and also added the code using `writeFile`. Thanks for the opinion. :) In addition, yes, when I wrote the html file, I used UTF-8. Thanks for the information! – awllower Aug 01 '16 at 14:35
  • 3
    The browser understands `` so it will decode the page using the GBK character encoding. ghci's `putStrLn` doesn't understand that obviously, so it tried to use your system encoding instead. – Reid Barton Aug 01 '16 at 14:40
  • 1
    @pdexter `curl` doesn't do any character decoding, so what you see on a terminal will depend on your terminal's locale settings, i.e. the `LANG` env variable for Unix-like systems. And, of course, your terminal will need the right fonts as well. – ErikR Aug 01 '16 at 14:41
  • @ErikR ah okay, good to know. Guess it's not unicode then. – pdexter Aug 01 '16 at 14:46
  • @awllower - what kind of system are you using? Windows, OS X, Linux? – ErikR Aug 01 '16 at 14:49
  • @ErikR I am using a mac el capitan. I also added this information to the question body. Thanks for the reminder. :) – awllower Aug 01 '16 at 14:54
  • The file is encoded in GBK. I've looked, and I cannot find any Haskell libraries which can convert GBK to Unicode. You should say what else you want to do with the file besides downloading it. Count words? Extract links? Just determine if the contents have changed? – ErikR Aug 01 '16 at 15:51
  • The site seems to change the contents sometimes even though there are no updates, so I want to determine if there are updates by checking the number of `a` links inside a `li` list which I think is close to the number of chapters now available. Previously, when using `tagsoup` to parse the html file, I got errors about utf-8 encoding, and failed to produce anything. Maybe I shall just let python do its job and then use haskell for the rest? But, in my humble opinion, there is not much left for haskell to do. Thanks for the effort. :-) – awllower Aug 01 '16 at 15:57
  • @Michael Why delete your answer? I was about to say that it worked! Thanks a lot for telling me about this package! But unfortunately I failed to convert it to `String` type, but only to `[GHC.Word.Word8]` with `BL.unpack`. Still thanks though. :-) – awllower Aug 01 '16 at 16:40
  • @awllower You can use tagsoup with bytestring since GBK doesn't change the encoding for ASCII characters. – ErikR Aug 01 '16 at 16:46
  • @ErikR Ok, let me try this approach. Thanks very much! – awllower Aug 01 '16 at 17:05

5 Answers5

8

I'm pretty sure that if you use Network.HTTP with the String type, it converts bytes to characters using your system encoding, which is, in general, wrong.

This is only one of several reasons I don't like Network.HTTP.

Your options:

  1. Use the Bytestring interface. It's more awkward for some reason. It'll also require you to decode the bytes to characters manually. Most sites give you an encoding in the response headers, but sometimes they lie. It's a giant mess, really.

  2. Use a different http fetching library. I don't think any remove the messiness of dealing with lying encodings, but they at least don't make it more awkward to not use the system encoding incorrectly. I'd look into wreq or http-client instead.

Carl
  • 26,500
  • 4
  • 65
  • 86
6

If you just want to download the file, e.g. to look at later, you just need to use the ByteString interface. It would be better to use http-client for this (or wreq if you have some lens knowledge). Then you can open it in your browser, which will see that it is a gbk file. So far, you would just be transferring the raw bytes as lazy bytestring. If I understand it, that's all the python is doing. Encodings are not an issue here; the browser is handling them.

But if you want to view the characters inside ghci, for example, the main problem is that nothing will handle the gbk encoding by default the way the browser can. For that you need something like text-icu and the underlying C libraries. The program below uses the http-client library together with text-icu - these are I think pretty much standard for this problem, though you could use the less powerful encoding library for as much of the problem as we have seen it so far. It seems to work okay:

import Network.HTTP.Client                     -- http-client
import Network.HTTP.Types.Status (statusCode)
import qualified Data.Text.Encoding as T       -- text
import qualified Data.Text.IO as T
import qualified Data.Text as T
import qualified Data.Text.ICU.Convert as ICU  -- text-icu
import qualified Data.Text.ICU as ICU
import qualified Data.ByteString.Lazy as BL

main :: IO ()
main = do
  manager <- newManager defaultManagerSettings
  request <- parseRequest "http://www.piaotian.net/html/7/7430/"
  response <- httpLbs request manager
  gbk <- ICU.open "gbk" Nothing
  let txt :: T.Text
      txt = ICU.toUnicode gbk $ BL.toStrict $ responseBody response
  T.putStrLn txt

Here txt is a Text value, i.e. basically just the 'code points'. The last bit T.putStrLn txt will use the system encoding to present the text to you. You can also explicitly handle the encoding with the functions in Data.Text.Encoding or the more sophisticate material in text-icu For example, if you want to save the text in the utf8 encoding, you would use T.encodeUtf8

So in my ghci the output looks like so

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" " http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
<title>走进修仙最新章节,走进修仙无弹窗全文阅读_飘天文学</title>
<meta http-equiv="Content-Type" content="text/html; charset=gbk" />

...

Does that look right? In my ghci, what I am seeing goes via utf8, since that's my system encoding, but note that the file is saying it is a gbk file, of course. If, then, you wanted to do some Text transformation and then save it as an html file -- then of course you would need to make sure the charset mentioned inside the file matched the encoding you used to make the bytestring you write to a file.

You could also of course get this into the shape of a Haskell String by replacing the last three lines with

let str :: String
    str = T.unpack $ ICU.toUnicode gbk $ BL.toStrict $ responseBody response
putStrLn str
Michael
  • 2,889
  • 17
  • 16
  • Yes, this looks right. But T.encodeUtf8 maps to ByteString, how should I transform that to a regular string? Or, is this possible? Thanks. :) – awllower Aug 01 '16 at 17:31
  • If you want to go from a `Text` to a `String`, you use `T.unpack`; then bytestring is not involved. – Michael Aug 01 '16 at 17:32
  • Note that if you save this String with `Prelude.writeFile` it will still look wrong in the browser, since `writeFile` will use your system encoding which seems to be `utf8`. The file itself says it is `gbk` which is how the browser will view it. – Michael Aug 01 '16 at 17:36
  • I get it! Thanks a lot! – awllower Aug 01 '16 at 17:39
  • 2
    To write a file that is `gbk` encoded using the above snippet, you would need to write `B.writeFile "this.html" $ ICU.fromUnicode gbk $ T.pack str` – Michael Aug 01 '16 at 17:39
6

Here is an updated answer which uses the encoding package to convert the GBK encoded contents to Unicode.

#!/usr/bin/env stack
{- stack
  --resolver lts-6.0 --install-ghc runghc
  --package wreq --package lens --package encoding --package binary
-}

{-# LANGUAGE OverloadedStrings #-}

import Network.Wreq
import qualified Data.ByteString.Lazy.Char8 as LBS
import Control.Lens
import qualified Data.Encoding as E
import qualified Data.Encoding.GB18030 as E
import Data.Binary.Get

main = do
  r <- get "http://www.piaotian.net/html/7/7430/"
  let body = r ^. responseBody :: LBS.ByteString
      foo = runGet (E.decode E.GB18030) body 
  putStrLn foo
ErikR
  • 51,541
  • 9
  • 73
  • 124
1

This program produces the same output as the curl command:

curl "http://www.piaotian.net/html/7/7430/"

Test with:

stack program > out.html
open out.html

(If not using stack, just install the wreq and lens packages and execute with runhaskell.)

#!/usr/bin/env stack
-- stack --resolver lts-6.0 --install-ghc runghc --package wreq --package lens --package bytestring
{-# LANGUAGE OverloadedStrings #-}

import Network.Wreq
import qualified Data.ByteString.Lazy.Char8 as LBS
import Control.Lens

main = do
  r <- get "http://www.piaotian.net/html/7/7430/"
  LBS.putStr (r ^. responseBody)
ErikR
  • 51,541
  • 9
  • 73
  • 124
  • I am installing the packages. And the curl command does not do it right. :( – awllower Aug 01 '16 at 15:17
  • 1
    I tested the curl command by saving the output to a file and using `open` to view it in a browser. Is that how you tested it? – ErikR Aug 01 '16 at 15:28
  • Oh, I tested it wrongly. Indeed the `curl` command gives the correct output. But `runhaskell` plus the `wreq` and `lens` still cannot display the output correctly. It sounds that the problem is terminal cannot handle the gbk encoding. Also, `readFile` cannot read the file saved by the `curl` command properly. So I am wondering now is there a way to parse the file after `curl` does the job? Or maybe we can change the encoding to utf-8 or big-5 or something? Thanks again for this. – awllower Aug 01 '16 at 15:36
  • awllower you can change from gbk bytes to Text with `text-icu` and then from Text to utf8 bytes via `Data.Text.Encoding` - or else by doing something more sophisticated with `text-icu` again. – Michael Aug 01 '16 at 16:47
  • @Michael: Sorry, I am not very familiar with working with encodings, and I don't really understand how to change from Text to utf8 bytes via `Data.Text.Encoding`: I see no particular function for doing this task in the module? – awllower Aug 01 '16 at 17:04
  • `T.encodeUtf8` turns a `Text` into (utf8 encoded) bytes; `T.decodeUtf8` turns a (utf8 encoded) bytestring into a `Text`. – Michael Aug 01 '16 at 17:09
  • 1
    I undeleted my answer for what its worth. – Michael Aug 01 '16 at 17:13
1

Since you said you are interested in just the links, there is no need to convert the GBK encoding to Unicode.

Here is a version which prints out all links like "123456.html" in the document:

#!/usr/bin/env stack
{- stack
  --resolver lts-6.0 --install-ghc runghc
  --package wreq --package lens
  --package tagsoup
-}

{-# LANGUAGE OverloadedStrings #-}

import Network.Wreq
import qualified Data.ByteString.Lazy.Char8 as LBS
import Control.Lens
import Text.HTML.TagSoup
import Data.Char
import Control.Monad

-- match \d+\.html
isNumberHtml lbs = (LBS.dropWhile isDigit lbs) == ".html"

wanted t = isTagOpenName "a" t && isNumberHtml (fromAttrib "href" t)

main = do
  r <- get "http://www.piaotian.net/html/7/7430/"
  let body = r ^. responseBody :: LBS.ByteString
      tags = parseTags body
      links = filter wanted tags
      hrefs = map (fromAttrib "href") links
  forM_ hrefs LBS.putStrLn
ErikR
  • 51,541
  • 9
  • 73
  • 124
  • Really thanks for ending a day of turmoil. I shall get to know how this works later. After that, I will accept this answer. Thanks again! – awllower Aug 01 '16 at 17:29
  • So I guess we just don't use `putStrLn` and there is no problem then. ;P – awllower Aug 03 '16 at 05:55