4

Intention: Small application to learn Haskell: Downloads a wikipedia-article, then downloads all articles linked from it, then downloads all articles linked from them, and so on... until a specified recursion depth is reached. The result is saved to a file.

Approach: Use a StateT to keep track of the download queue, to download an article and to update the queue. I build a list IO [WArticle] recursively and then print it.

Problem: While profiling I find that total memory in use is proportional to number of articles downloaded.

Analysis: By literature I'm lead to believe this is a laziness and/or strictness issue. BangPatterns reduced the memory consumed but didn't solve proportionality. Furthermore, I know all articles are downloaded before the file output is started.

Possible solutions:

1) The function getNextNode :: StateT CrawlState IO WArticle (below) already has IO. One solution would be to just do the file writing in it and only return the state. It would mean the file is written to in very small chunks though. Doesn't feel very Haskell..

2) Have the function buildHelper :: CrawlState -> IO [WArticle] (below) return [IO WArticle]. Though I wouldn't know how to rewrite that code and have been advised against it in the comments.

Are any of these proposed solutions better than I think they are or are there better alternatives?

import GetArticle (WArticle, getArticle, wa_links, wiki2File) -- my own
type URL = Text

data CrawlState =
     CrawlState  ![URL]       ![(URL, Int)]
          --    [Completed]    [(Queue, depth)]
-- Called by user
buildDB :: URL -> Int -> IO [WArticle]
buildDB startURL recursionDepth = buildHelper cs
    where cs = CrawlState [] [(startURL, recursionDepth)]

-- Builds list recursively
buildHelper :: CrawlState -> IO [WArticle]
buildHelper !cs@(CrawlState _ queue) = {-# SCC "buildHelper" #-}
  if null queue
    then return []
    else do
      (!article, !cs') <- runStateT getNextNode cs
      rest <- buildHelper cs'
      return (article:rest)

-- State manipulation
getNextNode :: StateT CrawlState IO WArticle
getNextNode = {-# SCC "getNextNode" #-} do
  CrawlState !parsed !queue@( (url, depth):queueTail ) <- get
  article <- liftIO $ getArticle url
  put $ CrawlState (url:parsed) (queueTail++ ( if depth > 1
          then let  !newUrls  = wa_links article \\ parsed
                    !newUrls' = newUrls          \\ map fst queue
                    in zip newUrls' (repeat (depth-1))
          else []))
  return article

startUrl = pack "https://en.wikipedia.org/wiki/Haskell_(programming_language)"
recursionDepth = 3

main :: IO ()
main =  {-# SCC "DbMain" #-}
  buildDB startUrl recursionDepth
   >>= return . wiki2File
   >>= writeFile "savedArticles.txt"

Full code at https://gitlab.com/mattias.br/sillyWikipediaSpider. Current version limited to only download the first eight links from each page to save time. Without changing it download 55 pages at ~600 MB heap usage.

Thanks for any help!

Benjamin Hodgson
  • 42,952
  • 15
  • 108
  • 157
  • 1
    Your assumption here is incorrect: `IO [WArticle]` is still lazy in the evaluation each element of the list. `IO` does not force strictness. An example of this would be executing the code `do {a <- return [undefined,1]; print (a !! 1)}`. Long story short, using `[IO Article]`s just makes life harder for yourself. – AJF Jun 24 '18 at 20:50
  • Not a direct answer, but [conduit](http://hackage.haskell.org/package/conduit) and similar libraries provide precisely that feature: Process large “lists” of data with more explicit resource control. – Joachim Breitner Jun 24 '18 at 23:43
  • @AJFarmar, thank you for your answer. So the question should be rephrased to "Why is laziness broken?"? As `do {db <- buildDB url 3}` prints out trace messages for downloading all articles right away (trace messages are printed in `getArticle` which is not cited here). – Mattias Bermell Rudfeldt Jun 25 '18 at 06:17

2 Answers2

2

2) Is [IO WArticle] want I want in this case?

Not quite. The problem is that some of the IO WArticle actions depend on the result of a previous action: the links to future pages reside in previously obtained pages. [IO Warticle] can't provide that: it is pure in the sense that you can always find an action in the list without executing the previous actions.

What we need is a kind of "effectful list" that lets us extract articles one by one, progressively performing the neccessary effects, but not forcing us to completely generate the list in one go.

There are several libraries that provide these kinds of "effectful lists": streaming, pipes, conduit. They define monad transformers that extend a base monad with the ability to yield intermediate values before returning a final result. Usually the final result is of a type different from the values that are yielded; it might be simply unit ().

Note: The Functor, Applicative and Monad instances for these libraries differ from the corresponding instances for pure lists. The Functor instances map over the resulting final value, not over the intermediate values which are yielded. To map over the yielded values, they provide separate functions. And The Monad instances sequence effectful lists, instead of trying all combinations. To try all combinations, they provide separate functions.

Using the streaming library, we could modify buildHelper to something like this:

import Streaming
import qualified Streaming.Prelude as S

buildHelper :: CrawlState -> Stream (Of WArticle) IO ()
buildHelper !cs@(CrawlState _ queue) = 
  if null queue
    then return []
    else do (article, cs') <- liftIO (runStateT getNextNode cs)
            S.yield article
            buildHelper cs'

And then we could use functions like mapM_ (from Streaming.Prelude, not the one from Control.Monad!) to process the articles one by one, as they are generated.

danidiaz
  • 26,936
  • 4
  • 45
  • 95
  • Cheers! I'll look into this. I must have updated the question while you were typing the your answer. Sorry! I'll suggest an edit to improve readability for future readers. – Mattias Bermell Rudfeldt Jun 25 '18 at 21:16
  • `Streaming.Prelude.toHandle` seemed appropriate for writing to file. So with `db :: Stream (Of WArticle) IO ()` i did `main = do {handle <- System.IO.openFile "outfile.txt" WriteMode; S.toHandle handle $ S.show db`. It still consumes A LOT of RAM but never reaches above ~450 MB no matter how many articles it downloads. I guess that's because I have RAM to spare. – Mattias Bermell Rudfeldt Jun 27 '18 at 11:23
1

Adding a further explaination and code building upon the answer of danidiaz. Here's the final code:

import Streaming
import qualified Streaming.Prelude as S
import System.IO (IOMode (WriteMode), hClose, openFile)

buildHelper :: CrawlState -> Stream (Of WArticle) IO ()
buildHelper cs@( CrawlState _ queue ) = 
  if null queue
    then return ()
    else do
      (article, cs') <- liftIO (runStateT getNextNode cs)
      S.yield article
      buildHelper cs'

main :: IO ()
main = do outFileHandle <- openFile filename WriteMode
          S.toHandle outFileHandle  . S.show . buildHelper $
              CrawlState [] [(startUrl, recursionDepth)]
          hClose outFileHandle

outFileHandle is a usual file output handle.

S.toHandle takes a stream of String and writes them to the specified handle.

S.show maps show :: WArticle -> String over the stream.

An elegant solution that creates a lazy stream even though it is produced by a series of IO actions (namely downloading websites) and writes it to a file as results become available. On my machine it still uses a lot of memory (relative to the task) during execution but never exceeds 450 MB.

  • 1
    You can use `System.IO.withFile` instead of `openFile`/`hClose`; it frees you from having to remember to close the handle. Working with `String`—which is a linked list of characters—can be inefficient, for some uses a packed representation like `Bytestring` or `Text` is better. Also, you can pass the `-M` option to the runtime to set the maximum heap size: http://downloads.haskell.org/~ghc/latest/docs/html/users_guide/runtime_control.html – danidiaz Jun 27 '18 at 18:19