2

There is a string(just for testing) and I want to replace all instances of <p> under the div, <div id="text">. How do I do that ?

I tested with m and s modifiers, but in vain (Only the first one gets replaced). I have given my Perl code below :

#!/usr/bin/perl
use strict;
use warnings;

my $string = <<STRING;
<div id="main">
    hellohello
    <div id="text">
        nokay.
        <p>This is p1, SHUD B replaced</p>
        Alright
        <p>This is p2, SHUD B replaced</p>
        Yes 2
        <p>this is P3, SHUD B replaced</p>
        Okay done
        bye
    </div>
    bye
    <p>this is not under the div whose id is text and SHUDN'T b replaced</p>
</div>

STRING

my $str_bak = $string;
print "Sring is : \n$string\n\n";

$string =~ s/(<div id="text">.*?)<p>(.*)(<\/p>.*?<\/div>)/$1<p style="text-align:left;">$2 $3/sig;

print "Sring now is : \n$string\n\n";
M-D
  • 10,247
  • 9
  • 32
  • 35
  • 2
    Never parse XML/HTML/CSV files using regex. Use the existing modules, they are usually mature, stable and well tested. – dgw May 28 '12 at 11:25

4 Answers4

2

Using XML::XSH2:

open :F html 1.html ;
for //div[@id="text"]/p
    set @style "text-align:left;" ;
save :b ;
choroba
  • 231,213
  • 25
  • 204
  • 289
0

Try this

(?is)<p>.+?</p>(?=.*?</div>)

Code

$subject =~ s!(?is)<p>.+?</p>(?=.*?</div>)!!g;

Explanation

"
(?is)        # Match the remainder of the regex with the options: case insensitive (i); dot matches newline (s)
<p>          # Match the characters “<p>” literally
.            # Match any single character
   +?           # Between one and unlimited times, as few times as possible, expanding as needed (lazy)
</p>         # Match the characters “</p>” literally
(?=          # Assert that the regex below can be matched, starting at this position (positive lookahead)
   .            # Match any single character
      *?           # Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
   </div>       # Match the characters “</div>” literally
)
"

UPDATE

Change your code as follows:

#!/usr/bin/perl
use strict;
use warnings;

my $string = <<STRING;
<div id="main">
    hellohello
    <div id="text">
        nokay.
        <p>This is p1, SHUD B replaced</p>
        Alright
        <p>This is p2, SHUD B replaced</p>
        Yes 2
        <p>this is P3, SHUD B replaced</p>
        Okay done
        bye
    </div>
    bye
    <p>this is not under the div whose id is text and SHUDN'T b replaced</p>
</div>

STRING

my $str_bak = $string;
print "Sring is : \n$string\n\n";

$string =~ s!(?is)<p>.+?</p>(?=.*?</div>)!!g;;

print "Sring now is : \n$string\n\n";

And that script gives exactly what is built for. Showing all contents except <p> elements within div.

Cylian
  • 10,970
  • 4
  • 42
  • 55
  • That doesn't do the trick. Any other ideas ? My code is a working example, so you can copy it for testing. – M-D May 28 '12 at 08:30
  • Change ```` to ```` in your code and test it [here](http://regexr.com?3134q). – Cylian May 28 '12 at 08:35
  • If you add `
    ` to it, nothing matches.
    – M-D May 28 '12 at 08:45
  • Provide a link to test with required data on [RegExr](http://gskinner.com/RegExr/?). The example uses ``positive lookahead``and ``PCRE`` has some limitations for using a ``lookbehind``. And also see my update. – Cylian May 28 '12 at 09:20
0

First I need to say I used the trick explained in this post Passing a regex substitution as a variable in Perl?

#!/usr/bin/perl
use strict;
use warnings;

my $string = <<STRING;
<div id="main">
    hellohello
    <div id="text">
        nokay.
        <p>This is p1, SHUD B replaced</p>
        Alright
        <p>This is p2, SHUD B replaced</p>
        Yes 2
        <p>this is P3, SHUD B replaced</p>
        Okay done
        bye
    </div>
    bye
    <p>this is not under the div whose id is text and SHUDN'T b replaced</p>
</div>

STRING

my $str_bak = $string;
print "Sring is : \n$string\n\n";

$string =~ s/(<div id="text">.*?)<p>(.*)(<\/p>.*?<\/div>)/$1<p style="text-align:left;">$2 $3/sig;

sub modify
{
  my($text, $code) = @_;
  $code->($text);
  return $text;
}

my $new_text = modify($string, sub {
    my $div = '(<div id="text">.*?</div>)';
    $string =~ m#$div#is;
    my $found = $1;
print "found : \n$found\n\n";
    my $repl = modify ($found, sub {
         $_[0] =~ s/<p>/<p style="text-align:left;">/g
    }) ;
    $_[0] =~ s/$found/$repl/ 
});

print "Result : \n$new_text\n\n";

The trick is to use the modify sub to permit Higher Order tratment on the text. Then we can isolate the <div id="text">...</div> and apply substituation of <p> on it.

Community
  • 1
  • 1
Nicocube
  • 2,962
  • 2
  • 20
  • 28
0

Thank you all for the help.

I could find a regex for that. So I did it with a "workaround". This is how :

while( $val =~ s/(<div id="article">.*?)<p>/$1<p style="text-align:left;">/sig )
{  }

So basically that regex is applicable only to the first match and thats why we are making it repeated in an empty while loop( the loop exits when there are no more match to replace).

M-D
  • 10,247
  • 9
  • 32
  • 35