4

The xml-conduit documentation only lists examples where the entire XML tree is consumed by a ConduitM, for example:

<people>
    <person age="25">Michael</person>
    <person age="2">Eliezer</person>
</people>

I'm trying to parse a tree where besides the <person> tags from above, there are deeply nested subtrees I'm not interested in (their exact schema might even be unknown), for example:

<people>
    <person age="25">Michael</person>
    <tagImNotInterestedIn><!-- deeply nested complex subtree --></tagImNotInterestedIn>
    <person age="2">Eliezer</person>
</people>

When parsing with the people.hs example from the docs, I get the following exception:

people.hs: XmlException {xmlErrorMessage = "Expected end tag for: Name {nameLocalName = \"people\", nameNamespace = Nothing, namePrefix = Nothing}", xmlBadInput = Just (EventBeginElement (Name {nameLocalName = "tagImNotInterestedIn", nameNamespace = Nothing, namePrefix = Nothing}) [])}

Basically, I'm looking for a way to ignore any tag (including all its children and attributes) except specific ones I specify parsers for. When using DOM-based parsers like HXT, this is obviously easy, but the tag docs explicitly states that it will fail unless all children are consumed.

The only hypothetical way I can think of accomplishing this is to use functions from Control.Exception to build up a Conduit with a Maybe a result (returning Nothing on exception) and then use orE to combine it with the parsers itself

Although it has been stated that the xml-conduit API needs some updating, I think there has to be a less-hackish way to ignore an entire subtree. Any ideas will be appreciated!

Community
  • 1
  • 1
Uli Köhler
  • 13,012
  • 16
  • 70
  • 120
  • 2
    With a streaming parser you have to parse everything until you reach the close event of the the tag enclosing the current subtree. When you leave the inner parser of a `tag`, it expects to immediately parse the closing tag. Sadly the is no function for that in xml-conduit. One way to do it is to write a custom sink, count the opening tags (decrements on closing) and trash everything in between until the counter reach zero. – Piezoid Jun 21 '14 at 03:20
  • @Piezoid I have submitted a pull request that adds functions to perform this task. By using recursion, I didn't have to use a counter, even if that would be a possibility. I'll post an answer after it has been merged. Meanwhile, thanks for you support! – Uli Köhler Aug 02 '14 at 14:11
  • @UliKöhler did that PR ever land? Can you link to it? – Flip Apr 24 '17 at 19:32
  • 1
    @Flip Original PR https://github.com/snoyberg/xml/pull/40 but I didnt have time to finish it so pavelkogan stepped in and created https://github.com/snoyberg/xml/pull/58 which was merged – Uli Köhler Apr 25 '17 at 02:11

1 Answers1

1

Since 1.5.0 Text.XML.Stream.Parse provides a function takeTree, which probably could be used for the purpose.

{-# LANGUAGE OverloadedStrings #-}

import           Control.Monad                (void)
import           Control.Monad.Trans.Class    (lift)
import           Control.Monad.Trans.Resource (MonadThrow, runResourceT)
import           Data.ByteString.Lazy         (ByteString)
import           Data.ByteString.Lazy.Char8   (concat)
import           Data.Conduit                 (ConduitT, runConduit, (.|))
import           Data.Conduit.List            (mapM_)
import           Data.Text                    (Text, unpack)
import           Data.XML.Types               (Event)
import           Prelude                      hiding (concat, mapM_)
import           Text.XML.Stream.Parse        (choose, content, def,
                                               ignoreAnyTreeContent,
                                               ignoreAttrs, manyYield, many_,
                                               parseLBS, requireAttr, tag',
                                               tagNoAttr, takeTree)

data Person = Person Int Text deriving Show

parsePerson :: MonadThrow m => ConduitT Event o m (Maybe Person)
parsePerson = tag' "person" (requireAttr "age") $ \age -> do
    name <- content
    return $ Person (read $ unpack age) name

parsePeople :: MonadThrow m => ConduitT Event Person m ()
parsePeople = void $ tagNoAttr "people" $
  many_ (choose([takeTree "person" ignoreAttrs, ignoreAnyTreeContent])) .| manyYield parsePerson

persons :: ByteString
persons = concat [
    "<people>"
  , "<foo/>"
  , "<person age=\"25\">Michael</person>"
  , "<bar/>"
  , "<person age=\"2\">Eliezer</person>"
  , "<tagImNotInterestedIn>x</tagImNotInterestedIn>"
  , "</people>"

main :: IO ()
main = runResourceT $
  runConduit $ parseLBS def persons .| parsePeople .| mapM_ (lift . print)

The code above is based on xml-conduit sample. Only parsePeople is changed.

λ> main
Person 25 "Michael"
Person 2 "Eliezer"
palik
  • 2,425
  • 23
  • 31