2

I am completely new to Haskell and trying to learn. I decided to write a short (unbalanced) binary search tree code just to get going. It breaks a text into words, adds the words to the binary tree (discarding repetitions), and then traverses the tree in order to print out the sorted list of words in the text.

data BinTree t = ExternalNode
               | InternalNode (BinTree t) t (BinTree t)

treeInsert :: Ord t => BinTree t -> t -> BinTree t
treeInsert ExternalNode                     w = InternalNode ExternalNode w ExternalNode
treeInsert tree@(InternalNode left v right) w
  | w == v    = tree
  | w < v     = InternalNode (treeInsert left w) v right
  | otherwise = InternalNode left v (treeInsert right w)

treeFromList :: Ord t => [t] -> BinTree t
treeFromList l = go ExternalNode l
  where
    go acc []       = acc
    go acc (x : xs) = acc `seq` go (treeInsert acc x) xs

inOrderList :: BinTree t -> [t]
inOrderList ExternalNode                = []
inOrderList (InternalNode left v right) = (inOrderList left) ++ [ v ] ++ (inOrderList right)

main :: IO ()
main = do
  tmp <- readFile "words.txt"
  printList . inOrderList . treeFromList $ words tmp

  where
    printList []       = return ()
    printList (x : xs) = do
      putStrLn x
      printList xs

The program works fine on small texts. Then I fed the King James Bible to it. It crashes complaining that the stack size is too small. I have to increase the stack size to 200M to make it work!

Where is my mistake? I imagine it could have something to do with lazy evaluation messing up stuff. In any case, the problem is not with the depth of the binary search tree, which is only 163 for the Bible example.

Fernando
  • 595
  • 1
  • 3
  • 12
  • 7
    "only" 163? That's 2^164-1 elements, if the tree is balanced. – Bartek Banachewicz Dec 17 '14 at 14:01
  • Can you provide a link to the file `words.txt` that you used? Is [this version of the KJV](http://www.gutenberg.org/cache/epub/10/pg10.txt) comparable to yours – ErikR Dec 17 '14 at 15:42
  • Also - are you running it compiled or from ghci? – ErikR Dec 17 '14 at 15:43
  • Thanks for the input guys. I am using exactly the KJV version user5402 linked to. I said the height of the tree is about 163 (I computed it). This doesn't mean that the tree has 2^163 elements (it is not full), it has much less, it is just a bit tall because it is not a balanced search tree. FYI the tree has 34057 words in it. And, no, I am not running from ghci. I am running from the command line (the full program is there). – Fernando Dec 17 '14 at 16:02
  • 1
    Using ghc 7.8.3 I am able to run the program with only 27M of stack: `bintree +RTS -K27M` (after compiling with `ghc -O2 bintree.hs -rtsopts`) – ErikR Dec 17 '14 at 16:19
  • @ user5402: True! My binary search for stack size was: 20M fails, so try 200M. :) But still, isn't 27M a bit much? If each node of the tree (internal or external, nevermind) has a word with <= 10 bytes and two pointers to the left and right subtrees, we would get a total tree size of ~ 1.8M. Why does the stack grow so much? If the depth of the tree is only 163, in a language like C the recursion depth for traversal would not exceed 163. Is it in Haskell not the same? How can it be avoided? – Fernando Dec 17 '14 at 16:28
  • 3
    Well, `String` uses about a dozen bytes per character. And if any branches of your tree are unevaluated due to laziness, that's probably half a dozen bytes or so... – MathematicalOrchid Dec 17 '14 at 17:09
  • 1
    Indeed, `String` is not usually what you want if you care about performance—use `Text`. If you absolutely only ever need ASCII, you can even use `ByteString`, but that's frowned upon. – dfeuer Dec 17 '14 at 18:05

1 Answers1

2

The problem is that you are building up too deeply nested thunks.

This version adds seq calls in treeInsert to force evaluation at each level of the tree and can run in very little stack:

import System.Environment
import Control.Monad

data BinTree t = ExternalNode
               | InternalNode (BinTree t) !t (BinTree t)

treeInsert :: Ord t => BinTree t -> t -> BinTree t
treeInsert ExternalNode                     w = InternalNode ExternalNode w ExternalNode
treeInsert tree@(InternalNode left v right) w
  | w == v    = tree
  | w < v     = let t = treeInsert left w  in t `seq` InternalNode t v right
  | otherwise = let t = treeInsert right w in t `seq` InternalNode left v t

treeFromList :: Ord t => [t] -> BinTree t
treeFromList l = go ExternalNode l
  where
    go acc []       = acc
    go acc (x : xs) = let t = treeInsert acc x in t `seq` go t xs

inOrderList :: BinTree t -> [t]
inOrderList ExternalNode                = []
inOrderList (InternalNode left v right) = (inOrderList left) ++ [ v ] ++ (inOrderList right)

main1 = do
  (arg0:_) <- getArgs
  tmp <- readFile arg0
  let t = treeFromList $ words tmp
  forM_ (inOrderList t) putStrLn

main = main1

You can also use strictness annotations in the definition of BinTree:

data BinTree t = ExternalNode | InternalNode !(BinTree t) !t !(BinTree t)

in lieu of the seq calls in treeInsert - this is what Data.Set does.

It appears that the seq call in treeFromList doesn't have much effect.

ErikR
  • 51,541
  • 9
  • 73
  • 124
  • `BangPatterns` is a GHC extension for using such things in patterns. Using them in `data` declarations is Haskell 98. – dfeuer Dec 17 '14 at 17:59
  • That said, the bangs in `data` declarations are, as far as I know, and with the exception of explicitly unpacked fields, mostly a convenience. You can instead use a "smart constructor" that forces the appropriate arguments, leaving open the possibility of installing thunks in those fields in some particular situation (which I'm considering doing in `Data.IntMap` to support a `fromFunction` function). – dfeuer Dec 17 '14 at 18:02