0
<div class="apple">

    <a href="..." > ... </a>

    <div class="boy">
        (some content here)
    </div>

    <div class="cat">
        <b>Text One.</b> <br> <i>Text Two.</i>
    </div>

    <div class="dog">
        <b>Text One.</b> <br> <i>Text Two.</i>
    </div>

</div>

.
. (and there are couple more structure with cat class inside but not necessarily under the class apple)
.

<div class="zoo">
.
    <div class="cat">
        <b>Text One.</b> <br> <i>Text Two.</i>
    </div>
.
</div>
.
.
.

I am working with PHP. I want to know that how to select exactly "Text One." only from the div class="cat" the under div class="apple" out of the html (but not from any other).

Currnetly I am doing something like this:

$html=file_get_contents('xxx.html');

$a=preg_match_all("/\<div class\=\"apple\"(.*)\<div class\=\"cat\"\>(.*)<\/b\>/s",$html,$b);

foreach ($b[1] as $value) {
    echo strip_tags("$value");
}

I just found it online, it may be possible but not be the best choice to due with the situation.

Many irrelevant content were also selected (i got everything within the last tag and more content than i want in )

please suggest me the appropriate regular expression or a better way to solve.

Gary Chan
  • 25
  • 4
  • 5
    You want an HTML parser, not a regex. – melwil Aug 21 '15 at 11:18
  • possible duplicate of [How to extract img src, title and alt from html using php?](http://stackoverflow.com/questions/138313/how-to-extract-img-src-title-and-alt-from-html-using-php) – Kalle Aug 21 '15 at 11:43

1 Answers1

0

Since you mention a better way, I would suggest going with the simple html dom library, http://simplehtmldom.sourceforge.net.

In your example you would use it like this:

<?php

include 'simple_html_dom.php';

$html = str_get_html('<div class="apple">

    <a href="..." > ... </a>

    <div class="boy">
        (some content here)
    </div>

    <div class="cat">
        <b>Text One.</b> <br> <i>Text Two.</i>
    </div>

    <div class="dog">
        <b>Text One.</b> <br> <i>Text Two.</i>
    </div>

</div>

.
. (and there are couple more <div class="apple"> structure with cat class inside)
.

<div class="apple">
.
.
.
</div>
.
.
.');

$text = $html->find('div.cat b',0)->innertext;

print $text . PHP_EOL;

// it will print this
// Text One.
Alex Andrei
  • 7,315
  • 3
  • 28
  • 42