Extracting the body text of an HTML document using PHP

Question

I know it's better to use DOM for this purpose but let's try to extract the text in this way:

<?php


$html=<<<EOD
<html>
<head>
</head>
<body>
<p>Some text</p>
</body>
</html>
EOD;


        preg_match('/<body.*?>/', $html, $matches, PREG_OFFSET_CAPTURE);

        if (empty($matches))
            exit;

        $matched_body_start_tag = $matches[0][0];
        $index_of_body_start_tag = $matches[0][1];

        $index_of_body_end_tag = strpos($html, '</body>');


        $body = substr(
                        $html,
                        $index_of_body_start_tag + strlen($matched_body_start_tag),
                        $index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)
        );

echo $body;

The result can be seen here: http://ideone.com/vH2FZ

As you can see, I am getting more text than expected.

There is something I don't understand, to get the correct length for the substr($string, $start, $length) function, I am using:

$index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)

I don't see anything wrong with this formula.

Could somebody kindly suggest where the problem is?

Many thanks to you all.

EDIT:

Thank you very very much to all of you. There is just a bug in my brain. After reading your answers, I now understand what the problem is, it should either be:

  $index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag));

Or:

  $index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag);

I read your comment and the problem is that you have new lines, when counting the characters, \n (or \r\n) are also counted, so your offsets are incorrect, that's why you are reading more text at the end. I will update my answer with better explanation. — ludesign, Feb 06 '11 at 02:15

ludesign · Accepted Answer · 2011-02-06T02:55:49.933

The problem is that your string have new lines where . in the pattern only matches single lines, you need to add /s modifier to make . to match multi-lines

Here is my solution, I prefer it this way.

<?php

$html=<<<EOD
<html>
<head>
</head>
<body buu="grger"     ga="Gag">
<p>Some text</p>
</body>
</html>
EOD;

    // get anything between <body> and </body> where <body can="have_as many" attributes="as required">
    if (preg_match('/(?:<body[^>]*>)(.*)<\/body>/isU', $html, $matches)) {
        $body = $matches[1];
    }
    // outputing all matches for debugging purposes
    var_dump($matches);
?>

Edit: I am updating my answer to provide you with better explanation why your code fails.

You have this string:

<html>
<head>
</head>
<body>
<p>Some text</p>
</body>
</html>

Everything seems to be fine with it but actually you have non-print characters (new line characters) on each line. You have 53 printable characters and 7 non printable (new lines, \n == 2 characters actually for each new line).

When you reach this part of the code:

$index_of_body_end_tag = strpos($html, '</body>');

You get the correct position of </body> (starting at position 51) but this counts the new lines.

So when you reach this line of code:

$index_of_body_start_tag + strlen($matched_body_start_tag)

It it evaluated to 31 (new lines included), and:

$index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)

It is evaluated to 51 - 25 + 6 = 32 (characters you have to read) but you only have 16 printable characters of text between <body> and </body> and 4 non printable characters (new line after <body> and new line before </body>). And here is the problem, you have to group the calculation (prioritize) like so:

$index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag))

evaluated to 51 - (25 + 6) = 51 - 31 = 20 (16 + 4).

:) Hope this helps you to understand why prioritizing is important. (Sorry for misleading you about newlines it is only valid in regex example I gave above).

@ludesign: New lines have absolutely nothing to do with the problem at hand. — netcoder, Feb 06 '11 at 02:32
@netcoder: Yes, I apologize for my initial answer as new lines has nothing to do with the actual issue, I updated my answer. :| Thank you. — ludesign, Feb 06 '11 at 02:43
@netcoder: I am sorry but this is not the case, I don't know if you can see the exact time of the post but I left a comment and then start writing my explanation, It took me about 10 minutes to write it down and about two minutes to format it. Meanwhile you've updated your answer. I am not here for this reputation thing, I am here to help people solving their problems in my spare time. If you care to much for the reputation I will gladly give you the credits. Next time if I see you've posted an answer, I won't even bother to write, this way you can be sure I am not stealing your answers. — ludesign, Feb 06 '11 at 03:30
@ludesign: I don't mind the reputation either. If you have not done anything wrong, you shouldn't bother writing a novel about it, I know I wouldn't. I don't mind you answering the same questions as I do, I only mind when you say the exact same thing as others long after they already answered it. Use the upvote button instead of writing a duplicate answer. — netcoder, Feb 06 '11 at 04:17
@netcoder: I am writing in foreign language (English is Not my native language) and I have to re-read my answers, when I start to write my answers, there is usually no answer/s yet and when I finish writing, there are few answers posted already, while I was writing my own answer. I do write "novels" just to make thing clear. I am not going to invent another syntax for pcre just to make it look different then what others have written. Thank you for your critique but I don't feel guilty for being slow writer. From now on I will refresh the page before clicking the post button. No bad feelings :) — ludesign, Feb 06 '11 at 04:25

score 4 · Answer 2 · answered Feb 06 '11 at 02:07

4

Personally, I wouldn't use regex.

<?php

$html = <<<EOD

<html>
    <head>
        <title>Example</title>
    </head>
    <body>
        <h1>foobar</h1>
    </body>
</html>

EOD;

$s = strpos($html, '<body>') + strlen('<body>');
$f = '</body>';

echo trim(substr($html, $s, strpos($html, $f) - $s));

?>

returns <h1>foobar</h1>

answered Feb 06 '11 at 02:07

jhine

247
1
11

Note: Only works when has no attributes (eg: class, style, events) – Mavelo May 28 '18 at 16:31

netcoder · Answer 3 · 2011-02-06T02:21:03.870

2

The problem is in your substr computation of the ending index. You should substract all the way:

$index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag)

But you are doing:

+ strlen($matched_body_start_tag)

That said, it seems a little overkill considering you can do it using preg_match only. You just need to make sure you match across new lines, using the s modifier:

preg_match('/<body[^>]*>(.*?)<\/body>/s', $html, $matches);
echo $matches[1];

Outputs:

<p>Some text</p>

edited Feb 06 '11 at 02:21

answered Feb 06 '11 at 01:59

netcoder

66,435
19
125
142

Thank you very much for your answer. Your solution is neat and simple but I really need to know what goes wrong with the codes. – bobo Feb 06 '11 at 02:06
I really thought it has something to do with newlines. Thank you very much for pointing out the problem. – bobo Feb 06 '11 at 02:55

score 1 · Answer 4 · 2011-02-06T05:39:53.627

Somebodys probably already found your error, i didn't read all the replys.
The algebra is wrong.

code is here

Btw, first time seeing ideone.com, thats pretty cool.

$body = substr( 
          $html, 
          $index_of_body_start_tag + strlen($matched_body_start_tag),
          $index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag))
        );

or ..

$body = substr(
          $html,
          $index_of_body_start_tag + strlen($matched_body_start_tag),
          $index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag)
       );

Extracting the body text of an HTML document using PHP

4 Answers4