5

Task

Replace all the spaces in content of any tag with  .

y.html (sample file)

<p class=MsoNormal style='margin-top:1.0pt;margin-right:0cm;margin-bottom:1.0pt;
margin-left:34.0pt;text-indent:-19.8pt'><span lang=NL-BE style='font-size:10.0pt;
font-family:Symbol;color:black;mso-ansi-language:NL-BE'>·</span><span
class=GramE><span style='font-size:7.0pt;color:black'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span><span style='font-size:10.0pt;font-family:Arial;color:black'>Kit</span></span><span
style='font-size:10.0pt;font-family:Arial;color:black'> </span><span
class=SpellE><i><span style='font-size:10.0pt;font-family:Arial'>Strongyloides</span></i></span><i><span
style='font-size:10.0pt;font-family:Arial'> <span class=SpellE>ratti</span></span></i><span
style='font-size:10.0pt;font-family:Arial'> (nr. 9450) van <span class=SpellE>Bordier</span>
Affinity Products. </span><span lang=NL-BE style='font-size:10.0pt;font-family:
Arial;mso-ansi-language:NL-BE'>Zie bijsluiter in bijlage: CLKB_B_0306. Te
bewaren bij 2 – 8 °C tot vervaldatum.</span><span lang=NL-BE style='mso-ansi-language:
NL-BE'><o:p></o:p></span></p>

What I tried

#!/usr/bin/perl
use strict;
use warnings;
use Mojo::DOM;
open (my $fh, "<", "y.html") or die $!;
my $dom = Mojo::DOM->new(do{local $/ = undef; <$fh>});
$dom->find("*")->each( sub { $_->content( $_->content =~ s/\s/\&nbsp;/gr ) } );
print $dom;

Result from above script

<p class="MsoNormal" style="margin-top:1.0pt;margin-right:0cm;margin-bottom:1.0pt;
margin-left:34.0pt;text-indent:-19.8pt"><span&nbsp;lang="nl-be"&nbsp;style="font-size:10.0pt;&nbsp;font-family:symbol;color:black;mso-ansi-language:nl-be">·<span&nbsp;class="grame"><span&nbsp;style="font-s
ize:7.0pt;color:black">         <span&nbsp;style="font-size:10.0pt;font-family:arial;color:black">Kit<span&nbsp;style="font-size:10.0pt;font-family:arial;color:black"> <span&nbsp;class="spelle"><i><span&nb
sp;style="font-size:10.0pt;font-family:arial">Strongyloides<i><span&nbsp;style="font-size:10.0pt;font-family:arial"> <span&nbsp;class="spelle">ratti<span&nbsp;style="font-size:10.0pt;font-family:arial"> (n
r. 9450) van <span&nbsp;class="spelle">Bordier Affinity Products. <span&nbsp;lang="nl-be"&nbsp;style="font-size:10.0pt;font-family:&nbsp;arial;mso-ansi-language:nl-be">Zie bijsluiter in bijlage: CLKB_B_030
6. Te bewaren bij 2 – 8 °C tot vervaldatum.<span&nbsp;lang="nl-be"&nbsp;style="mso-ansi-language:&nbsp;nl-be"><o:p></o:p></span&nbsp;lang="nl-be"&nbsp;style="mso-ansi-language:&nbsp;nl-be"></span&nbsp;lang
="nl-be"&nbsp;style="font-size:10.0pt;font-family:&nbsp;arial;mso-ansi-language:nl-be"></span&nbsp;class="spelle"></span&nbsp;style="font-size:10.0pt;font-family:arial"></span&nbsp;class="spelle"></span&nb
sp;style="font-size:10.0pt;font-family:arial"></i></span&nbsp;style="font-size:10.0pt;font-family:arial"></i></span&nbsp;class="spelle"></span&nbsp;style="font-size:10.0pt;font-family:arial;color:black"></
span&nbsp;style="font-size:10.0pt;font-family:arial;color:black"></span&nbsp;style="font-size:7.0pt;color:black"></span&nbsp;class="grame"></span&nbsp;lang="nl-be"&nbsp;style="font-size:10.0pt;&nbsp;font-f
amily:symbol;color:black;mso-ansi-language:nl-be"></p>

I'm not getting the desired output, it's adding &nbsp; in tag also (eg: </span&nbsp;), I want that to be done only on the content.

PS: I tried it with Mojo::DOM, but it's not necessary to use it, you can try any other parser if you want, still I would like to know what's wrong with my code?

Chankey Pathak
  • 21,187
  • 12
  • 85
  • 133

1 Answers1

4

This is a job where tokenizing the input makes it easier to work with. I therefore advise using HTML::TokeParser

#!/usr/bin/perl
use strict;
use warnings;
use utf8;

use HTML::TokeParser;

my $data = do {local $/; <DATA>};

my $p = HTML::TokeParser->new(\$data);

while (my $token = $p->get_token) {
    if ($token->[0] eq 'T') {
        my $text = $token->[1];
        $text =~ s/ /&nbsp;/g;
        print $text;
    } else {
        print "$token->[-1]";
    }
}

__DATA__
<html>
<body>
<p class=MsoNormal style='margin-top:1.0pt;margin-right:0cm;margin-bottom:1.0pt;
margin-left:34.0pt;text-indent:-19.8pt'><span lang=NL-BE style='font-size:10.0pt;
font-family:Symbol;color:black;mso-ansi-language:NL-BE'>·</span><span
class=GramE><span style='font-size:7.0pt;color:black'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span><span style='font-size:10.0pt;font-family:Arial;color:black'>Kit</span></span><span
style='font-size:10.0pt;font-family:Arial;color:black'> </span><span
class=SpellE><i><span style='font-size:10.0pt;font-family:Arial'>Strongyloides</span></i></span><i><span
style='font-size:10.0pt;font-family:Arial'> <span class=SpellE>ratti</span></span></i><span
style='font-size:10.0pt;font-family:Arial'> (nr. 9450) van <span class=SpellE>Bordier</span>
Affinity Products. </span><span lang=NL-BE style='font-size:10.0pt;font-family:
Arial;mso-ansi-language:NL-BE'>Zie bijsluiter in bijlage: CLKB_B_0306. Te
bewaren bij 2 – 8 °C tot vervaldatum.</span><span lang=NL-BE style='mso-ansi-language:
NL-BE'><o:p></o:p></span></p>
</body>
</html>

Outputs:

<html>
<body>
<p class=MsoNormal style='margin-top:1.0pt;margin-right:0cm;margin-bottom:1.0pt;
margin-left:34.0pt;text-indent:-19.8pt'><span lang=NL-BE style='font-size:10.0pt;
font-family:Symbol;color:black;mso-ansi-language:NL-BE'>·</span><span
class=GramE><span style='font-size:7.0pt;color:black'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span><span style='font-size:10.0pt;font-family:Arial;color:black'>Kit</span></span><span
style='font-size:10.0pt;font-family:Arial;color:black'>&nbsp;</span><span
class=SpellE><i><span style='font-size:10.0pt;font-family:Arial'>Strongyloides</span></i></span><i><span
style='font-size:10.0pt;font-family:Arial'>&nbsp;<span class=SpellE>ratti</span></span></i><span
style='font-size:10.0pt;font-family:Arial'>&nbsp;(nr.&nbsp;9450)&nbsp;van&nbsp;<span class=SpellE>Bordier</span>
Affinity&nbsp;Products.&nbsp;</span><span lang=NL-BE style='font-size:10.0pt;font-family:
Arial;mso-ansi-language:NL-BE'>Zie&nbsp;bijsluiter&nbsp;in&nbsp;bijlage:&nbsp;CLKB_B_0306.&nbsp;Te
bewaren&nbsp;bij&nbsp;2&nbsp;–&nbsp;8&nbsp;°C&nbsp;tot&nbsp;vervaldatum.</span><span lang=NL-BE style='mso-ansi-language:
NL-BE'><o:p></o:p></span></p>
</body>
</html>
Miller
  • 34,962
  • 4
  • 39
  • 60
  • Thanks, that works correctly. Can't it be solved using Mojo::DOM? – Chankey Pathak Jul 02 '14 at 07:04
  • you know how much I appreciate `Mojo::DOM`, but I don't think that's the right tool for the job. However, if you were wanting to replace only a section of the DOM tree, I'd probably recommend using Mojo::DOM for traversing and finding the correct branch, and then using `HTML::TokeParser` to replace the spaces in that branch using `->content` like you originally tried. – Miller Jul 02 '14 at 07:27
  • 1
    I know, same here :). I also knew you will be the first one to answer this question ;) you are very active! Yes that's also a way to do it. Anyway I used your approach using `HTML::TokeParser` and it did the job. Have a nice day! – Chankey Pathak Jul 02 '14 at 07:35