Matching Multiple 'id' Values Using RegEx in Combination with HTML::TreeBuilder

Question

I've got a list of URLs in an array:

http://www.site.sx/doc1.html
http://www.site.sx/doc2.html
http://www.site.sx/doc3.html
.
.
.

Let's view the contents of the first page, namely doc1.html:

<?xmlversion = "1.0" encoding = "utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
     "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
   <head>
      <title>Birds</title>
   </head>

   <body>
      <p>Some bird's feather's aren't actually blue, they're clear.</p>
      <!--LOOK HERE--><p id = "abc123FACT1xyz789">There exists an insect that makes 100-decibel sounds.</p> 
   </body>
</html>

Now, let's view the contents of the second page, namely doc2.html:

<?xmlversion = "1.0" encoding = "utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
     "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
   <head>
      <title>Cats</title>
   </head>

   <body>
      <p>Moota goes from house to house.</p>
      <!--LOOK HERE--><p id = "abc123FACT2xyz789">Falling from a higher altitude might be better than a lower one.</p> 
   </body>
</html>

doc3.html will have the same abc123.....xyz789-type of pattern for its ìd value, and so will the rest of the pages in my array. I want to capture the text content of each one. There is only one id value in each document with this particular pattern. Of course, there are multiple id values all over the document in reality, but--for sake of simplicity--we can disregard this.

BIG PICTURE: I want to put each match in like this:

$tree->look_down( _tag => 'p' , id => "abc123.*xyz789")->as_text; # NOT SURE HOW TO MAKE AN ARRAY OF MATCHES...

You know, `dictionary = dict(zip(TITLES, URLS))`, or something. — user3404787, Mar 11 '14 at 06:34
give us more ids. We can not find a pattern if you give us one example of a string without telling us what the id is. — Lodewijk Bogaards, Mar 12 '14 at 17:59
@mrhobo, what? Really? I must be terrible at explaining things... OK, try this, see my edit--in progress. — user3404787, Mar 13 '14 at 00:05

score 0 · Answer 1 · edited Mar 23 '14 at 09:28

0

my $match = $tree->look_down( _tag => 'p' , id => qr{abc123.*xyz789} )->as_text;

This will get what I'm after.

edited Mar 23 '14 at 09:28

Miller

34,962
4
39
60

answered Mar 13 '14 at 00:52

user3404787

11
1
6

Matching Multiple 'id' Values Using RegEx in Combination with HTML::TreeBuilder

1 Answers1