Matching only words inside an HTML tag using Perl

Question

I have an HTML content I am reading the HTML in Perl and want to catch only the words inside the tag i.e:

<span id="f002">From fairest creatures we desire increase,</span><br/>
<span id="f003">That thereby beauty’s rose might never die,</span><br/>
<span id="f004">But as the riper should by time decease,</span><br/>
<span id="f005">His tender heir might bear his memory:</span><br/>
<span id="f006">But thou contracted to thine own bright eyes,</span><br/>
<span id="f007">Feed’st thy light’s flame with self-substantial fuel,</span><br/>
<span id="f008">Making a famine where abundance lies,</span><br/>
<span id="f009">Thy self thy foe, to thy sweet self too cruel:</span><br/>
<span id="f010">Thou that art now the world’s fresh ornament,</span><br/>
<span id="f011">And only herald to the gaudy spring,</span><br/>
<span id="f012">Within thine own bud buriest thy content,</span><br/>
<span id="f013">And tender churl mak’st waste in niggarding:</span><br/>
<span id="f014">Pity the world, or else this glutton be,</span><br/>
<span id="f015">To eat the world’s due, by the grave and thee.</span>

I want to catch each and every words inside the span tag,

I have tried :

([\w|’|-]+)([\W])

But its matching the tag names also as words, check here: https://regex101.com/r/mD3qG4/3 kindly suggest some regex to achieve this

thanks

[An interesting story about pony and regex](http://stackoverflow.com/a/1732454/14673) — Luc M, Jul 15 '16 at 13:15

score 3 · Accepted Answer · answered Jul 15 '16 at 13:10

Never use regexes for processing HTML unless you are absolutely forced to, and probably not even then. There are several perfectly serviceable HTML parsers on CPAN, and HTML::TreeBuilder is quite adequate for this

Here's a program that processes your data as you requested. It finds all the span elements with an id attribute that looks like the regex pattern f\d{3} and stores their text contents in array @text

I've had to use utf8 at the top only because the text in the __DATA__ section contains some non-ASCII characters. If you're reading that from an external file then there's no need for that

use utf8;
use strict;
use warnings 'all';

use open qw/ :std :encoding(utf8) /;

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new;
$tree->parse_file(\*DATA);

my @text = map { $_->as_text } $tree->look_down( _tag => 'span', id => qr/^f\d{3}$/ );

print "$_\n" for @text;

__DATA__
<span id="f002">From fairest creatures we desire increase,</span><br/>
<span id="f003">That thereby beauty’s rose might never die,</span><br/>
<span id="f004">But as the riper should by time decease,</span><br/>
<span id="f005">His tender heir might bear his memory:</span><br/>
<span id="f006">But thou contracted to thine own bright eyes,</span><br/>
<span id="f007">Feed’st thy light’s flame with self-substantial fuel,</span><br/>
<span id="f008">Making a famine where abundance lies,</span><br/>
<span id="f009">Thy self thy foe, to thy sweet self too cruel:</span><br/>
<span id="f010">Thou that art now the world’s fresh ornament,</span><br/>
<span id="f011">And only herald to the gaudy spring,</span><br/>
<span id="f012">Within thine own bud buriest thy content,</span><br/>
<span id="f013">And tender churl mak’st waste in niggarding:</span><br/>
<span id="f014">Pity the world, or else this glutton be,</span><br/>
<span id="f015">To eat the world’s due, by the grave and thee.</span>

output

From fairest creatures we desire increase,
That thereby beauty’s rose might never die,
But as the riper should by time decease,
His tender heir might bear his memory:
But thou contracted to thine own bright eyes,
Feed’st thy light’s flame with self-substantial fuel,
Making a famine where abundance lies,
Thy self thy foe, to thy sweet self too cruel:
Thou that art now the world’s fresh ornament,
And only herald to the gaudy spring,
Within thine own bud buriest thy content,
And tender churl mak’st waste in niggarding:
Pity the world, or else this glutton be,
To eat the world’s due, by the grave and thee.

Matching only words inside an HTML tag using Perl

1 Answers1

output