Extracting Text in body that is not part of tag with HTML::TreeBuilder

Question

I have some ugly html that is emailed to my program that looks like:

<html>
    <head>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type" />
    </head>
    <body>
        Saved search results.<br>
    <br>
    Name: 'Some splunk search' <br>
    Query Terms: 'tag=foo NOT BAR=\&quot;Boom\&quot;' <br>
    Link to results: <a href="https://foo/search/blahblahblah">
    https://foo/search/blahblahblah</a>
    <br>
    <br>
    <table border="1">

...snipped the rest for brevity.

I am able to pull the table elements out using HTML::TreeBuilder but can't figure out how to pull the "Name:" an "Query Terms" from above out without resorting to other means.

A $root->dump of the above looks like:

<html> @0
  <head> @0.0
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type" /> @0.0.0
  <body> @0.1
  <p> @0.1.0 (IMPLICIT)
     " Saved search results. "
     <br /> @0.1.0.1
     <br /> @0.1.0.2
     " Name: 'Some splunk search' "
     <br /> @0.1.0.4
     " Query Terms: 'tag=foo NOT BAR=\"Boom\""

So is there a way to get the naked text between the @0.1.0.2 and @0.1.0.4

Thanks! Todd

score 0 · Answer 1 · answered Feb 08 '13 at 19:48

0

If there is a pattern to the text, it might be easier to use a combination of HTML parsing and regular expressions.

my $body_text = $body->as_text(skip_dels => 1);

my ($name) = ($body_text =~ m#Name: '([^']+)'#s);
my ($query_terms) = ($body_text =~ m#Query Terms: '([^']+)'#s);

answered Feb 08 '13 at 19:48

Pasha Sadri

19
2

Yeah... That's what I've done currently, but it doesn't feel right. Seems like there should be a way to pluck those lines out. The dump even skips a number for them. (0.1.0.2, text, then 0.1.0.4) Thanks for the reply though... – Todd Feb 08 '13 at 20:43

Extracting Text in body that is not part of tag with HTML::TreeBuilder

1 Answers1