0

I want to do a simple thing: extract from a string (that is an HTML file) some specific parts of the code.

For example:

//Get a string from a website:
$homepage = file_get_contents('http://mywebsite.org');

//Then, search a particulare substring between two strings:
echo magic_substr($homepage, "<script language", "</script>");

//where magic_substr is this function (find in this awesome website):
function magic_substr($haystack, $start, $end) {

    $index_start = strpos($haystack, $start);
    $index_start = ($index_start === false) ? 0 : $index_start + strlen($start);

    $index_end = strpos($haystack, $end, $index_start); 
    $length = ($index_end === false) ? strlen($end) : $index_end - $index_start;

    return substr($haystack, $index_start, $length);
}

The output I want to get is, in this case, all the scripts on a page. However in my case, I can get only the first script. I think it's right because there aren't any recursions. But I don't know what's the best way to do this! Any suggestions?

MikkoP
  • 4,864
  • 16
  • 58
  • 106
alessandrob
  • 1,605
  • 4
  • 20
  • 23
  • 6
    Puppies horribly die whenever you don't use [DOM Parser](http://php.net/manual/en/book.dom.php) to find stuff in html docs. – moonwave99 Sep 22 '12 at 12:35
  • Hi, i tried with Simple Dom Parser, having troubles with "max_nested_level" .. so i moved in this way :) – alessandrob Sep 22 '12 at 12:37
  • what was the problem with max_nested_level? I believe the PHP Simple HTML Dom Parser can get that done. – raygo Sep 22 '12 at 12:38
  • However, note that the DOM Parser only works when the HTML is somewhat valid. – AndreKR Sep 22 '12 at 12:40
  • with Simple HTML Dom Parser i reach the limit of the nested function, that is 100, but i can't find how to change this value.. i read a lot in this website about the nesting level, but i don't found a solution.. so i thought to move in this way.. I know it's a bit ugly :-D – alessandrob Sep 22 '12 at 12:42
  • in the case the code isn't valid? Thank you all for the answers! – alessandrob Sep 22 '12 at 12:43

4 Answers4

1

Try this to extract data from any giving tags or data In your case
extractor($homepage,"script language,"script");
opps it's not showing script tag properly but you define as you define in your example

/*****************************************************************/
/* string refine_str($str,$from,$to="")                         */
/* show data between $from and $to and also remove $from and $to */
/* if $to is not provided $from will be considered             */
/* a string to remove.                                           */
/*****************************************************************/

function extractor($str,$from,$to)
{
    $from_pos = strpos($str,$from);
    $from_pos = $from_pos + strlen($from);
    $to_pos   = strpos($str,$to,$from_pos);// to must be after from
    $return   = substr($str,$from_pos,$to_pos-$from_pos);
    unset($str,$from,$to,$from_pos,$to_pos );           
    return $return;

}    
Sohail Ahmed
  • 1,667
  • 14
  • 23
  • It's the same of "my" function :D I can only see the first string that is between the $from string and the $to string.. in my case there must be 19 matches of this type.. i know the html structure of the specific file i want to "parse" and i'm sure that the string "from" and "to" are always the same – alessandrob Sep 22 '12 at 12:59
  • ok i m posting second answer it will return array of all occourense – Sohail Ahmed Sep 22 '12 at 13:01
  • I posted it check it now it is in bottom of page – Sohail Ahmed Sep 22 '12 at 13:03
1
/****************************************************************/
/*  array getSelectiveContent($content,$from,$to,$exclude="")   */
/*  return array of content between provided                    */
/*  from and to positions.                                      */
/****************************************************************/

function getSelectiveContent($content,$from,$to,$exclude="")
{
    $return = array(); // array for return elements
    $size_FROM = strlen($from); 
    $size_TO = strlen($to);
while(true)
{
    $pos = strpos($content,$from); // find first occurance of $from
    if( $pos === false )
    {
        break;  // if not exist break loop
    }
    else
    {
        $element  = extractor($content,$from,$to); // fetch first element
        if($exclude == "")
        {
            if( trim($element) != "" )
            {
                $return[] = trim($element);
            }
        }
        else
        {
            if(trim($element) != "" && !strstr($element,$exclude)) // if nothing in range, and exclude is not in it
            {
                $return[] = trim($element); // put fetched content in array.
            }
        }
        $content = substr($content,$pos+strlen($element)+$size_FROM+$size_TO); // remove $from to $to from content 
    }
}
unset($content,$from,$to,$element,$exclude,$pos,$size_FROM,$size_TO);
return $return;
}
Sohail Ahmed
  • 1,667
  • 14
  • 23
0

I like Prototype/jQuery-like way to get elements from dom-tree.

Try some from jQuery-like interface for PHP. I don't tried it in PHP.

EDIT:

For valid HTML/XML try Tidy or HTML Purifier or htmlLawled.

Community
  • 1
  • 1
Anton Bessonov
  • 9,208
  • 3
  • 35
  • 38
0
$text="this is an example of text extract in from very long long text this is my test of the php";
$start="this";
$end="of";
$i=substr_count($text,$start);
$k=substr_count($text,$end);
$len1=strlen($start);
$len2=strlen($end);
$temp=$text;
for ($j=1;$j<=$i;$j++){
        $pos1=strpos($temp,$start);
    $pos2=strpos($temp,$end);
    $subs=substr($temp,$pos1+$len1,$pos2-($pos1+$len1));
    echo $subs.'<br/>';
    $temp=substr($temp,$pos2+$len2,strlen($temp)-strlen($subs));
}
Afshin
  • 4,197
  • 3
  • 25
  • 34