4

With XML::Twig using the set_text method - there is a warning:

set_text ($string) Set the text for the element: if the element is a PCDATA, just set its text, otherwise cut all the children of the element and create a single PCDATA child for it, which holds the text.

So if I want to do something simple, like - say - changing the case of all the text in my XML::Document:

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

my $twig = XML::Twig->new(
    'pretty_print'  => 'indented_a',
    'twig_handlers' => {
        '_all_' => sub {
            my $newtext = $_->text_only;
            $newtext =~ tr/[a-z]/[A-Z]/;
            $_->set_text($newtext);
        }
    }
);
$twig->parse( \*DATA );
$twig->print;

__DATA__
<root>
    <some_content>fish
        <a_subnode>morefish</a_subnode>
    </some_content>
    <some_more_content>cabbage</some_more_content>
</root>

This - because of set_text replacing children - gets clobbered into:

<root></root>

But if I focus on just one (bottom level) node (e.g. a_subnode) then it works fine.

Is there an elegant way to replace/transform text within an element without clobbering the data structure below it? I mean, I can do test on the presence of children or something similar, but ... there seems like there should be a better way of doing this. (A different library maybe?)

(And for the sake of clarity - this is my example of transliterating all the text in a document, my actual use case is rather more convoluted, but is still 'about' in place text tranformation).

I'm considering perhaps a node cut/and/paste approach (cut all children, replace text, paste all children) but that seems to be an inefficient approach.

Sobrique
  • 52,974
  • 7
  • 60
  • 101

2 Answers2

4

Instead of having the handler on _all_, try having it only on text elements: #TEXT, and change text_only to text. It should work.

update: Or use the char_handler option when you create the twig: char_handler => sub { uc shift }, instead of the handler.

mirod
  • 15,923
  • 3
  • 45
  • 65
  • 1
    It's pretty close (probably close enough for my needs). But it doesn't quite give the desired output, as it doesn't adjust the top level `some_content` node. (presumably due to the presence of children?) – Sobrique Jun 12 '15 at 13:05
  • Drats! I missed this. And it looks like a bug. The handler is not called when the text is followed by an open tag. I'll check this we. – mirod Jun 12 '15 at 13:31
  • Possibly. I'm not sure, because there's some oddities with `text` vs. `text_only`. It also seems to reformat in a way that isn't the indent format specified. (But I can live with that too). – Sobrique Jun 12 '15 at 14:18
  • Duh! there's a simpler way: use `char_handler` (edited above). – mirod Jun 12 '15 at 14:28
  • I take it that only works per character though? I'm (ideally) looking for a more general case that lets me regex transform. – Sobrique Jun 12 '15 at 14:30
  • no, it works on the entire string (as sent by the parser though, so it will be called for non-significant whitespace, and if the content of an element includes a line return it will be called 3 times, one on the first line, once for the line return itself and one for the second line) – mirod Jun 12 '15 at 16:34
2

My current approach is to:

  • iterate all the nodes.
  • cut all the children.
  • amend the text.
  • paste all the children.

This seems inefficient, but it does appear to work:

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;
use Data::Dumper;

sub replace_text {
    my ( $twig, $element ) = @_;

    my $newtext = $element->text_only;
    my @children;
    foreach my $child ( $element->children ) {
        if ( not $child->tag eq "#PCDATA" ) {
            push( @children, $child->cut );
        }
    }
    $newtext =~ tr/[a-z]/[A-Z]/;
    $element->set_text($newtext);

    $_->paste( 'last_child', $element ) for @children;
}

my $twig =
    XML::Twig->new( 'twig_handlers' => { '_all_' => \&replace_text, } );
$twig->parse( \*DATA );

print "Result:\n";
$twig->print;

__DATA__
<root>
    <some_content>fish
        <a_subnode>morefish</a_subnode>
    </some_content>
    <some_more_content>cabbage</some_more_content>
</root>

This turns my output into:

<root><some_content>FISH
        <a_subnode>MOREFISH</a_subnode></some_content><some_more_content>CABBAGE</some_more_content></root>

So whilst it does transmogrify the nodes, it also for some reason, breaks the output format.

Reparsing it:

XML::Twig -> new ( 'pretty_print' => 'indented_a' ) -> parse ( $twig -> sprint ) -> print;

Seems to do the trick. (Although double parsing just to reformat seems even less elegant)

<root>
  <some_content>FISH
        <a_subnode>MOREFISH</a_subnode></some_content>
  <some_more_content>CABBAGE</some_more_content>
</root>
Sobrique
  • 52,974
  • 7
  • 60
  • 101