Symfony2 DomCrawler and FB2 book format parser

Question

All!

How do I parse correctly described XML file with Symfony2 DomCrawler component?

I need to split all the sections and collect an internal tags (epigraph, p, poem etc.) with the current section together which belongs to this section only.

I've standard FB2 book XML format described below:

<?xml version="1.0" encoding="utf-8"?>
<FictionBook xmlns="http://www.gribuser.ru/xml/fictionbook/2.0" xmlns:l="http://www.w3.org/1999/xlink">
<description></description>
<body>
<section>
    <title><p><strong>Level 1, section 1</strong></p></title>
    <section>
        <title><p><strong>Level 2, section 2</strong></p></title>
        <section>
            <title><p><strong>Level 3, section 3</strong></p></title>
            <p>Level 3, section 3, paragraph 1</p>
            <poem>
                <stanza>
                    <v>bla-bla-bla 1</v>
                    <v>bla-bla-bla 2</v>
                    <v>bla-bla-bla 3</v>
                </stanza>
            </poem>
            <p>Level3, section 3, paragraph 2</p>
            <subtitle><strong>x x x</strong></subtitle>
        </section>
        <section>
            <title><p><strong>Level 3, section 4</strong></p></title>
            <p>Level 3, section 4, paragraph 1</p>
            <p>Level 3, section 4, paragraph 2</p>
            <subtitle><strong>x x x</strong></subtitle>
        </section>
        <section>
            <title><p><strong>Level 3, section 5</strong></p></title>
            <p>Level 3, section 5, paragraph 1</p>
            <p>Level 3, section 5, paragraph 2</p>
            <p>Level 3, section 5, paragraph 3</p>
            <empty-line/>
            <subtitle>This file was created</subtitle>
            <subtitle>with BookDesigner program</subtitle>
            <subtitle>bookdesigner@the-ebook.org</subtitle>
            <subtitle>22.04.2004</subtitle>
        </section>
    </section>
</section>
</body>
</FictionBook>

The code below do not work, so could somebody help me to solve this? Btw, title parsed correctly... but section's tags not...

private function loadBookSections(Crawler $crawler)
{
    $sections = $crawler->filter('section')->each(function(Crawler $node) {
        $c = $node->filter('section')->reduce(function(Crawler $node, $i) {
            return ($i == 0);
        });

        return array(
            'title' => $node->filter('title')->text(),
            'inner' => $c->html(),
        );
    });

    echo "*******************************************\n";

    foreach($sections as $section ) {
        echo ">>> ".$section['title']."\n";
        echo "!!! ".$section['inner']."\n";
    }
}

And Thanks for help!

can you use the built in serializer/deserializer with your xml instead of the dom crawler? [look here](http://symfony.com/doc/current/components/serializer.html) — Sehael, Nov 19 '13 at 20:53

score 1 · Accepted Answer · answered Nov 20 '13 at 15:12

After four days... I've found the solution via XPath...

private function loadBookSections(Crawler $crawler)
{

    $sections = $crawler->filter('section')->each(function(Crawler $node) {
        return array(
            'title' => $node->filter('title')->text(),
            'inner' => $node->filterXPath("//*[not(section)]")->html(),
        );
    });

    foreach($sections as $section) {
        echo "TITLE: ".$section['title']."\n";
        echo "INNER: ".$section['inner']."\n";
    }
}

score -1 · Answer 2 · answered Nov 18 '13 at 12:57

If you reduce your XML file quite a bit you get something like this:

<section>
    <section>
        <!-- ... -->
    </section>
    <section>
        <!-- ... -->
    </section>
    <section>
        <!-- ... -->
    </section>
</section>

You want to catch the children section elements, not the parent one.

Currently you are iterating only over the list of parent section elements, which means you only get the HTML of the parent section element.

To iterate over the children, you need to select section section instead of section.

Side information to further improve your code: instead of the ugly reduce call, just use ->first() to get the first element of the node list.

In total, your code will be:

$sections = $crawler->filter('section section')->each(function(Crawler $node) {
    $c = $node->filter('section')->first();

    return array(
        'title' => $node->filter('title')->text(),
        'inner' => $c->html(),
    );
});

Thanks, Wouter J., but your solution just lost first entry of the parent's data... More over, if we'll use 3 or more levels of an input document - the clause $c->html() returns ALL the children sections belonging to this root... — Alexander Vasilenko, Nov 19 '13 at 11:45
I think in my case DomCrawler should contains the opposite method for the filter('section') (something like NOT filter() or discard(...)) which returns everything except belonging section tags... — Alexander Vasilenko, Nov 19 '13 at 12:08

Symfony2 DomCrawler and FB2 book format parser

2 Answers2