Reconstruct a graph from BFS output in Haskell

Question

I want to reconstruct the incidence structure of a graph in Haskell, which is given by the output of a breadth first traversal of it. Explicitly, the output consists of a root vertex and a list of neighborhoods (a neighborhood is a list of vertices marked as new or old (= already visited)), where each neighborhood corresponds to the least vertex which has not been assigned to a neighborhood, yet.

In any imperative language, I would solve the problem by using a queue:

Input: root vertex r, list of neighborhoods L
(1) Put r into the empty queue Q
(2) if Q is empty then STOP
(3) extract the first vertex v of Q
(4) extract the first neighborhood N of L
(5) append the unvisited vertices of N to Q
(6) remove the markings (new/old) of the nodes of N and assign v to N
(7) goto (2)

I tried to implement this naive algorithm in Haskell (by using a list or by using Data.Sequence as queue), but ghci always runs out of memory. This should not happen, because although the input consists of 300MB data, 16GB RAM should clearly suffice.

Therefore the naive implementation seems to cause a memory leak. How would you implement this algorithm in Haskell?

Edit: Here are the (slightly simplified) data types, I use:

data Output = Out !Vertex ![[BFSNode]]
data Vertex = Vertex Integer SomeMoreComplexData
data BFSNode = New Vertex | Old Integer

data Graph = ![Vertex] ![(Integer,[Integer])]

The data type "Output" contains the already parsed BFS output consisting of the root vertex and the lists of neighborhoods. BFSNode corresponds to a node in the BFS tree which belongs to either a new vertex which is visited for the first time, or to an old vertex which already has been visited and which is therefore referred by its unique number. Note that the parsing process works fine and consumes very few memory.

My aim is to convert "Output" into the data type "Graph" which consists of the lists of vertices and of an incidence list.

Here is a simplified version of my implementation:

readTree :: [[BFSNode]] -> Seq Integer -> Graph
readTree [] _ = Graph [] []
readTree (nb:nbs) qs =
    let (i :< qs') = viewl qs
        newVs = fromList $! map nodeNr . filter isNew $ nb
        (Graph vs adj) = readTree nbs $ qs' >< newVs
    in  Graph (map unNew (filter isNew nb) ++ vs) ((i,nub $ map nodeNr nb):adj)

"nbs" is the list of neighborhoods, "qs" is the queue. The function "nodeNr" extracts the unique identification number from a vertex, "isNew" tests whether a vertex is new, and "unNew" unpacks a new vertex from the data type "BFSNode".

Edit2: I think I localized the problem now. Maybe it has nothing to do with my implementation of the conversion process. My failure was to use the build in function "read" to read the data type "Output" from a file. I realized now that Haskell has problems with reading big files. Even if it were just about reading a list of integers, e.g.

main = do 
    txt <- readFile "test"
    writeFile "test2" . show $ (read txt :: [Integer]) }

the program will run out of memory if the file "test" is big enough. I understand now, that it is no good idea to parse data in this way, since "read" will load all data into the memory before showing any output, but I still do not understand why it fills 16GB of RAM although the file amounts not even 500MB. Do you have any idea what is wrong with "read"? Does Haskell show the same behavior on your machines?

Edit3: Now I implemented a stream based parsing function "readOutput" which takes a String and returns the data type "Output". This function is lazy, so I immediately get an output when I call it. But when I compose it with my conversion function "readTree" (which is clearly tail-recursive) I get no output at all and the memory usage increases as usual. What am I doing wrong?

Edit4: The problem in Edit3 came from some strictifications which I removed now.

Regarding the naïve implementation you already have: Try compiling it with `ghc -O2` instead of using GHCi. — gspr, Dec 04 '13 at 15:23
It's also worth adding some strictness to the leaves of the `Sequence`, eg make sure that all data is fully evaluated before going into the queue, what is the implementation of a vertex? Can it be unboxed/strictified? — daniel gratzer, Dec 04 '13 at 15:38
Is there anyway to see some subset of the test data/code? It's easier to optimize then. — daniel gratzer, Dec 04 '13 at 15:41
@gspr: Thank you for your fast response! Unfortunately, compiling with -O2 does not change anything. — Dune, Dec 05 '13 at 09:49
@jozefg: It seems to be a good idea to strictify the vertices before appending them to the queue, as the memory usage increases more slowly now, but it still runs out of memory. I will try to add more relevant informations. — Dune, Dec 05 '13 at 09:55

score 3 · Answer 1 · answered Dec 05 '13 at 06:52

This question does not specify a key ingredient - how is the graph going to be represented in Haskell? Functional programs require carefully thought out data structures to maximize sharing and run efficiently. Usually, this means they're recursively built from nothing (inductive). There's a paper on inductive graphs and functional graph algorithms‎ that gives one representation:

module Test where

data Graph a = Empty | Extension (Graph a) [Int] (Int, a)
               deriving Show

That is, a graph is either Empty, or a (smaller) graph extended by one node. This is exactly how lists are built using Cons in functional languages, except that the additional node has to specify the smaller graph, the predecessor links ([Int]), and the new node number and data, (Int,a). Note that they also implemented this as an abstract type ''for efficiency reasons.''

A graph with one node can be generated by extending the empty graph.

singleton :: (Int,a) -> Graph a
singleton x = Extension Empty [] x

Using this structure, it's simple to define a recursive parse algorithm for your BFS tree.

data Mark a = Visited Int | New (Int,a) deriving Show

parse :: (Int,a) -> [[Mark a]] -> Graph a
parse x nbrs = extend Empty [x] nbrs

extend :: Graph a -> [(Int,a)] -> [[Mark a]] -> Graph a
extend g [] [] = g
extend g _  [] = Empty -- leftover nodes, really an error.
extend g [] _  = Empty -- leftover neighborhoods, really an error.
extend g (x : tl) (nbr : nbrs) =
  extend (Extension g (seen nbr) x) (news tl nbr) nbrs

news :: [(Int,a)] -> [Mark a] -> [(Int,a)]
news l (New x : tl) = news (uniq l x) tl
news l (_ : tl) = news l tl
news l [] = l

uniq :: [(Int,a)] -> (Int,a) -> [(Int,a)]
uniq (x:tl) y = x : if (fst x == fst y) then tl else uniq tl y
uniq [] y = [y]

seen :: [Mark a] -> [Int]
seen (Visited i : tl) = i : seen tl
seen (_ : tl) = seen tl
seen [] = []

m0 = [New (1,())]
m1 = [Visited 0, New (2,()), New (3,())]
m2 = [Visited 1, New (3,())]
m3 = [Visited 1, Visited 2]    
nbrs = [m0,m1,m2,m3]

Testing it out,

$ ghci
GHCi, version 7.6.3: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Prelude> :load Test
[1 of 1] Compiling Test             ( Test.hs, interpreted )
Ok, modules loaded: Test.
*Test> parse (0,()) nbrs
Extension (Extension (Extension (Extension Empty [] (0,())) [0] (1,())) [1] (2,())) [1,2] (3,())

For efficiency, you could do the following:

The news and seen functions could be combined let (ns,sn) = newseen nbr ([],[]) and made tail-recursive (passing their partially constructed lists and returning immediately) for efficiency.
Your input could keep track of the node at the center of each neighbor list. This would avoid the list concatenation in the stack of neighbors. Alternatively, you could use a functional dequeue to hold that stack.

If you haven't seen it, I'd recommend Okasaki's book on purely functional data structures.

Thank you very much for this great answer! I will try to adapt your implementation for my purposes, and answer back soon. But besides that, have you any idea, where the memory leak comes from? — Dune, Dec 05 '13 at 11:20
Parsing is one of the biggest pains in Haskell. There's example code for reading 1M ints in Haskell's [vector tutorial](http://www.haskell.org/haskellwiki/Numeric_Haskell:_A_Vector_Tutorial#An_example:_filling_a_vector_from_a_file). The usual explanation for memory leaks is maintaining references to too many computed objects, which Haskell zealously caches. To get around this, most people use some form of streaming (see Blaze.Builder). A good list of best-in-class software is associated with the [warp](http://hackage.haskell.org/package/warp) server. — David M. Rogers, Dec 05 '13 at 16:14
Thank you for this hint. But even with a stream based parsing function the same problem occurs (as I explained in Edit3 above). This is really frustrating... — Dune, Dec 05 '13 at 17:02

Reconstruct a graph from BFS output in Haskell

1 Answers1