extract xml from xml embebed in html

Question

im trying to get the xml presented here http://www.ncbi.nlm.nih.gov/sra/ERX086768?report=FullXml but its a bit tricky cause they dont give any suport for it. The purpose is to get the xml to php in order to go trought the xml.

can someone give a hint?

Are you trying to fetch the xml from the page? Such as using file_get_contents()? Or can you copy paste it yourself, (JUST the xml code without the rest of the page)? — Freeman, Apr 06 '13 at 19:43
im still a bit new to this kind of stuff, but what i want is to get the xml automaticaly (im doing a webservice), preferably without the html tags — jose bode, Apr 06 '13 at 19:48
strange to want to scrape from html when that site seems to have numerous API's for data. Worst case could use php simplehtmldom library and convert the html tags to xml tags/attributes. Take more time setting that up than finding correct REST API — charlietfl, Apr 06 '13 at 19:59
just wanted to thank everyone for the answers, it was good help. — jose bode, Apr 06 '13 at 22:09

hakre · Accepted Answer · 2013-04-07T14:58:39.717

It's not really true that XML presented via HTML therein wouldn't be XML as well.

What you're looking for is something called textContent in DOMDocument. That will give you only the text from that HMTL. Like it is displayed "as text" in the browser.

So all you need to do is to load the HTML document into a DOMDocument. Because it contains errors the internal error are used:

$url = 'http://www.ncbi.nlm.nih.gov/sra/ERX086768?report=FullXml';

$doc = new DOMDocument();
libxml_use_internal_errors(TRUE);
$doc->loadHTMLFile($url);
libxml_use_internal_errors(FALSE);

The next part implies specific knowledge about the page being scraped. In your case the XML is the said text-content of all div-tags with class attribute "xml-tag" *followed* after the tag with the id "ResultView".

These tags can be easily fetched with an xpath query, then their text-content is stored into an array:

$xpath  = new DOMXPath($doc);
$nodes  = $xpath->query('//*[@id="ResultView"]/following-sibling::div[@class="xml-tag"]');
$buffer = array();
foreach ($nodes as $node) {
    $buffer[] = $node->textContent;
}

So everything left now is to create a new DOMDocument and load that XML buffer into it, doing some nice formattings and the output:

$new = new DOMDocument();
$new->preserveWhiteSpace = FALSE;
$new->formatOutput = TRUE;
$new->loadXML(implode('', $buffer));
$new->save('php://output');

These roughly 20 lines of code produce the following output then:

<?xml version="1.0"?>
<EXPERIMENT_PACKAGE>
  <EXPERIMENT alias="SC_EXP_7229_8#56" center_name="SC" accession="ERX086768">
    <IDENTIFIERS>
      <PRIMARY_ID>ERX086768</PRIMARY_ID>
      <SUBMITTER_ID namespace="SC">SC_EXP_7229_8#56</SUBMITTER_ID>
    </IDENTIFIERS>
    <TITLE/>
    <STUDY_REF accession="ERP000913" refname="Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977" refcenter="SC">
      <IDENTIFIERS>
        <PRIMARY_ID>ERP000913</PRIMARY_ID>
        <SUBMITTER_ID namespace="SC">Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977</SUBMITTER_ID>
      </IDENTIFIERS>
    </STUDY_REF>
    <DESIGN>
      <DESIGN_DESCRIPTION>Standard</DESIGN_DESCRIPTION>
      <SAMPLE_DESCRIPTOR accession="ERS074283" refname="MR223754-sc-2011-11-18T11:31:44Z-1306470" refcenter="SC">
        <IDENTIFIERS>
          <PRIMARY_ID>ERS074283</PRIMARY_ID>
          <SUBMITTER_ID namespace="SC">MR223754-sc-2011-11-18T11:31:44Z-1306470</SUBMITTER_ID>
        </IDENTIFIERS>
      </SAMPLE_DESCRIPTOR>
      <LIBRARY_DESCRIPTOR>
        <LIBRARY_NAME>4008297</LIBRARY_NAME>
        <LIBRARY_STRATEGY>WGS</LIBRARY_STRATEGY>
        <LIBRARY_SOURCE>GENOMIC</LIBRARY_SOURCE>
        <LIBRARY_SELECTION>RANDOM</LIBRARY_SELECTION>
        <LIBRARY_LAYOUT>
          <PAIRED NOMINAL_LENGTH="250"/>
        </LIBRARY_LAYOUT>
      </LIBRARY_DESCRIPTOR>
      <SPOT_DESCRIPTOR>
        <SPOT_DECODE_SPEC>
          <READ_SPEC>
            <READ_INDEX>0</READ_INDEX>
            <READ_CLASS>Application Read</READ_CLASS>
            <READ_TYPE>Forward</READ_TYPE>
            <BASE_COORD>1</BASE_COORD>
          </READ_SPEC>
          <READ_SPEC>
            <READ_INDEX>1</READ_INDEX>
            <READ_CLASS>Application Read</READ_CLASS>
            <READ_TYPE>Reverse</READ_TYPE>
            <RELATIVE_ORDER follows_read_index="0"/>
          </READ_SPEC>
        </SPOT_DECODE_SPEC>
      </SPOT_DESCRIPTOR>
    </DESIGN>
    <PLATFORM>
      <ILLUMINA>
        <INSTRUMENT_MODEL>Illumina HiSeq 2000</INSTRUMENT_MODEL>
      </ILLUMINA>
    </PLATFORM>
    <PROCESSING/>
  </EXPERIMENT>
  <SUBMISSION accession="ERA119046" center_name="SC" submission_date="2012-04-17T09:29:50Z" alias="ERP000913-sc-20120417-2" lab_name="">
    <IDENTIFIERS>
      <PRIMARY_ID>ERA119046</PRIMARY_ID>
      <SUBMITTER_ID namespace="SC">ERP000913-sc-20120417-2</SUBMITTER_ID>
    </IDENTIFIERS>
  </SUBMISSION>
  <STUDY alias="Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977" center_name="SC" accession="ERP000913">
    <IDENTIFIERS>
      <PRIMARY_ID>ERP000913</PRIMARY_ID>
      <SUBMITTER_ID namespace="SC">Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977</SUBMITTER_ID>
    </IDENTIFIERS>
    <DESCRIPTOR>
      <STUDY_TITLE>Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis</STUDY_TITLE>
      <STUDY_TYPE existing_study_type="Whole Genome Sequencing"/>
      <STUDY_ABSTRACT>http://www.sanger.ac.uk/resources/downloads/bacteria/</STUDY_ABSTRACT>
      <CENTER_PROJECT_NAME>Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis</CENTER_PROJECT_NAME>
      <STUDY_DESCRIPTION>http://www.sanger.ac.uk/resources/downloads/bacteria/
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/</STUDY_DESCRIPTION>
    </DESCRIPTOR>
  </STUDY>
  <SAMPLE alias="MR223754-sc-2011-11-18T11:31:44Z-1306470" center_name="SC" accession="ERS074283">
    <IDENTIFIERS>
      <PRIMARY_ID>ERS074283</PRIMARY_ID>
      <SUBMITTER_ID namespace="SC">MR223754-sc-2011-11-18T11:31:44Z-1306470</SUBMITTER_ID>
    </IDENTIFIERS>
    <SAMPLE_NAME>
      <COMMON_NAME>Streptococcus dysgalactiae subspecies equisimilis</COMMON_NAME>
      <TAXON_ID>119602</TAXON_ID>
      <SCIENTIFIC_NAME>Streptococcus dysgalactiae subsp. equisimilis</SCIENTIFIC_NAME>
    </SAMPLE_NAME>
    <SAMPLE_LINKS>
      <SAMPLE_LINK>
        <ENTREZ_LINK>
          <DB>biosample</DB>
          <ID>859730</ID>
        </ENTREZ_LINK>
      </SAMPLE_LINK>
    </SAMPLE_LINKS>
    <SAMPLE_ATTRIBUTES>
      <SAMPLE_ATTRIBUTE>
        <TAG>Strain</TAG>
        <VALUE>MR223754</VALUE>
      </SAMPLE_ATTRIBUTE>
      <SAMPLE_ATTRIBUTE>
        <TAG>Sample Description</TAG>
        <VALUE/>
      </SAMPLE_ATTRIBUTE>
      <SAMPLE_ATTRIBUTE>
        <TAG>ArrayExpress-StrainOrLine</TAG>
        <VALUE>MR223754</VALUE>
      </SAMPLE_ATTRIBUTE>
      <SAMPLE_ATTRIBUTE>
        <TAG>ArrayExpress-Sex</TAG>
        <VALUE>not applicable</VALUE>
      </SAMPLE_ATTRIBUTE>
      <SAMPLE_ATTRIBUTE>
        <TAG>ArrayExpress-Species</TAG>
        <VALUE>Streptococcus dysgalactiae subspecies equisimilis</VALUE>
      </SAMPLE_ATTRIBUTE>
    </SAMPLE_ATTRIBUTES>
  </SAMPLE>
  <RUN_SET>
    <RUN alias="SC_RUN_7229_8#56" center_name="SC" accession="ERR109334" total_spots="2708543" total_bases="406281450" size="334475592" load_done="true" published="2012-04-27 20:11:35" is_public="true" cluster_name="public" static_data_available="1">
      <IDENTIFIERS>
        <PRIMARY_ID>ERR109334</PRIMARY_ID>
        <SUBMITTER_ID namespace="SC">SC_RUN_7229_8#56</SUBMITTER_ID>
      </IDENTIFIERS>
      <EXPERIMENT_REF refname="SC_EXP_7229_8#56" refcenter="SC" accession="ERX086768">
        <IDENTIFIERS>
          <PRIMARY_ID>ERX086768</PRIMARY_ID>
          <SUBMITTER_ID namespace="SC">SC_EXP_7229_8#56</SUBMITTER_ID>
        </IDENTIFIERS>
      </EXPERIMENT_REF>
      <Pool>
        <Member member_name="" accession="ERS074283" sample_name="MR223754-sc-2011-11-18T11:31:44Z-1306470" spots="2708543" bases="406281450"/>
      </Pool>
    </RUN>
  </RUN_SET>
</EXPERIMENT_PACKAGE>

So don't re-invent the wheel, just learn about the existing tools. It's sometimes more easy than it looks like on first sight.

tnks a lot! i will try it. im sorry if this question was a repost, i did try to find how to solve it but i got a bit desperate... anyway tnks one more time :) — jose bode, Apr 07 '13 at 16:08

score 1 · Answer 2 · answered Apr 06 '13 at 19:42

1

http://php.net/manual/en/class.simplexmlelement.php

It will give you an easy interface to use the xml as an object. You might set some attributes in order to parse cdata values and attributes I suppose. To get the xml from a web server use something like curl or file_get_contents. But curl is recommended.

answered Apr 06 '13 at 19:42

radalin

600
5
13

the problem is i cant get the xml alone... what i get using curl is the full html page :/ – jose bode Apr 06 '13 at 19:44
Oh, sorry. Maybe you could try the Dom (http://php.net/manual/en/book.dom.php) after using curl or file_get_contents to get the html from the page. However to beautify that XML code they have put a good amount of HTML inside it so it will be a bit hard for you to extract that information. Maybe once you find the element that contains xml, you should replace the b elements with something else then use strip tags to get rid of html tags and the reparse the formerly b elements. Or you can traverse the dom too I suppose. – radalin Apr 06 '13 at 19:56

score 1 · Answer 3 · answered Apr 06 '13 at 19:43

1

clicking on send data>get takes you to another page. Options to download in different formats. This url: http://trace.ncbi.nlm.nih.gov/Traces/sra/?cmd=dload&run_list=ERR109334&format=fasta appears to provide the data in gzip format. Perhaps you can use a GET on this source instead of trying to parse the XML out of the HTML?

answered Apr 06 '13 at 19:43

Jonathan

4,916
2
20
37

1

that file is hugeee (13 MB and still downloading, i stopped it). the xml cant be this big – Freeman Apr 06 '13 at 19:58
maybe that's the `FASTA` file format? I didn't look at the other options on that page, but it *appears* to be a direct download link... – Jonathan Apr 06 '13 at 20:04
ahah sry, those are sequence files, it would stop at 1/2Gb only (i think xD) tnks anyway! – jose bode Apr 06 '13 at 20:06

score 1 · Answer 4 · answered Apr 06 '13 at 19:55

You would have to make a list of all the valid HTMl tags and remove them from the webpage. For example:

$taglist = ['div', 'b', 'input']; // List the HTML tags here.
$xml= (read in the webpage here);
foreach ($taglist as $tag) {
    $regex = '<' . $tag . '(?: [a-z]+(?:=.+))*?>';
    $xml = preg_replace($regex, '', $xml);

    // Repeat for the closing tag
    $regex = '</' . $tag . '(?: [a-z]+(?:=.+))*?>';
    $xml = preg_replace($regex, '', $xml);
}

After that finishes, $xml will contain the XML as a string, and PHP should be able to handle it.

ive tried with strip_tags, still something seems to pass it... if i clean it all will i be able to treat it has xml? — jose bode, Apr 06 '13 at 19:58

mohammad mohsenipur · Answer 5 · 2013-04-06T20:40:47.717

this class XmlReadcan do it. i put curl class for it too

curl:

 function HeaderProc($response,$Run="",$String=1/*[Is 1 IF Use for String Mode ]*/){
          if($String==1){
             $response=explode("\r\n",$response);  
          }
          $PartHeader=0;
          $out[$PartHeader]=array();
          while(list($key,$val)=each($response)){
              $name='';
              $value='';
              $flag=false;
              for($i=0;$i<strlen($val);$i++){
                  if($val[$i]==":"){
                      $flag=true;
                      for($j=$i+1;$j<strlen($val);$j++){
                        if($val[$i]=="\r" and $val[$i+1]=="\n"){    
                            break;
                        }
                        $value.=$val[$j];
                      }
                      break;
                  }
                  $name.=$val[$i]; 
              }
              if($flag){
                if($name=='' and $value==''){
                    $PartHeader++;  
                }else{
                  if(isset($out[$PartHeader][$name])){
                    if(is_array($out[$PartHeader][$name])){   
                        $out[$PartHeader][$name][]=$value;
                    }else{
                        $T=$out[$PartHeader][$name];
                        $out[$PartHeader][$name]=array();
                        $out[$PartHeader][$name][0]=$T;  
                        $out[$PartHeader][$name][1]=$value;  
                    }
                  }else{
                    $out[$PartHeader][$name]=$value;
                  }
                }
              }else{
                if($name==''){
                    $PartHeader++;  
                }else{
                    if(isset($out[$PartHeader][$name])){ 
                      if(is_array($out[$PartHeader][$name])){   
                        $out[$PartHeader][$name][]=$value;
                      }else{
                        $T=$out[$PartHeader][$name];
                        $out[$PartHeader][$name]=array();
                        $out[$PartHeader][$name][0]=$T;  
                        $out[$PartHeader][$name][1]=$name;  
                      }
                    }else{
                        $out[$PartHeader][$name]=$name; 
                    }
                } 
              }
              if($Run!=""){
                $Run($name,$value);  
              }
          }
          return $out;
}

class cURL { 
    var $headers; 
    var $user_agent; 
    var $compression; 
    var $cookie_file; 
    var $proxy; 
    var $Cookie; 
    function CookieAnalysis($Cookie){//convert str cookie to array cookie 
       //echo $Cookie;
       $this->Cookie=array();
       preg_match("~(.*?)=(.*?);~si",' '.$Cookie.'; ',$M);
       $this->Cookie[trim($M[1])]=trim($M[2]);
       return $this->Cookie;
    }
    function cURL($cookies=false,$cookie='cookies.txt',$compression='gzip',$proxy='') {
         $this->headers[] = 'Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8';
         $this->headers[] = 'Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3'; 
         $this->headers[] = 'Accept-Encoding:gzip,deflate,sdch';
         $this->headers[] = 'Accept-Language:en-US,en;q=0.8';
         $this->headers[] = 'Cache-Control:max-age=0';
         $this->headers[] = 'Connection:keep-alive';
         $this->user_agent = 'User-Agent:Mozilla/5.0 (SepidarSoft [Organic Search Engine Crawler] Linux Edition) AppleWebKit/536.5 (KHTML, like Gecko) SepidarBrowser/1.0.100.52 Safari/536.5';
         $this->compression=$compression; 
         $this->proxy=$proxy; 
         $this->cookies=$cookies; 
         if ($this->cookies == TRUE) $this->cookie($cookie); 
    } 
    function cookie($cookie_file) { 
         if (file_exists($cookie_file)) { 
            $this->cookie_file=$cookie_file; 
         } else { 
            fopen($cookie_file,'w') or $this->error('The cookie file could not be opened. Make sure this directory has the correct permissions');
            $this->cookie_file=$cookie_file; 
            @fclose($this->cookie_file); 
         } 
    }
    function GET($url) { 
         $process = curl_init($url); 
         curl_setopt($process, CURLOPT_HTTPHEADER, $this->headers); 
         curl_setopt($process, CURLOPT_HEADER, 1); 
         curl_setopt($process, CURLOPT_USERAGENT, $this->user_agent); 
         if ($this->cookies == TRUE) curl_setopt($process, CURLOPT_COOKIEFILE, $this->cookie_file);
         if ($this->cookies == TRUE) curl_setopt($process, CURLOPT_COOKIEJAR, $this->cookie_file);
         curl_setopt($process,CURLOPT_ENCODING , $this->compression); 
         curl_setopt($process, CURLOPT_TIMEOUT, 30); 
         if ($this->proxy) curl_setopt($process, CURLOPT_PROXY, $this->proxy); 
         curl_setopt($process, CURLOPT_RETURNTRANSFER, 1); 
         curl_setopt($process, CURLOPT_FOLLOWLOCATION, 1); 
         $response = curl_exec($process);
         $header_size = curl_getinfo($process,CURLINFO_HEADER_SIZE);
         $result['Header'] = HeaderProc(substr($response, 0, $header_size),'',1);
         foreach($result['Header'] as $HeaderK=>$HeaderP){
           if(!is_array($HeaderP['Set-Cookie']))continue;
           foreach($HeaderP['Set-Cookie'] as $key=>$val){
             $result['Header'][$HeaderK]['Set-Cookie'][$key]=$this->CookieAnalysis($val);
           }
         }
         $result['Body'] = substr( $response, $header_size );
         $result['HTTP_State'] = curl_getinfo($process,CURLINFO_HTTP_CODE);
         $result['URL'] = curl_getinfo($process,CURLINFO_EFFECTIVE_URL); 
         curl_close($process); 
         return $result; 
    }
    function POST($url,$data) { 
         $process = curl_init($url); 
         curl_setopt($process, CURLOPT_HTTPHEADER, $this->headers); 
         curl_setopt($process, CURLOPT_HEADER, 1); 
         curl_setopt($process, CURLOPT_USERAGENT, $this->user_agent); 
         if ($this->cookies == TRUE) curl_setopt($process, CURLOPT_COOKIEFILE, $this->cookie_file);
         if ($this->cookies == TRUE) curl_setopt($process, CURLOPT_COOKIEJAR, $this->cookie_file);
         curl_setopt($process, CURLOPT_ENCODING , $this->compression); 
         curl_setopt($process, CURLOPT_TIMEOUT, 30); 
         if ($this->proxy) curl_setopt($process, CURLOPT_PROXY, $this->proxy); 
         curl_setopt($process, CURLOPT_POSTFIELDS, $data); 
         curl_setopt($process, CURLOPT_RETURNTRANSFER, 1); 
         curl_setopt($process, CURLOPT_FOLLOWLOCATION, 1); 
         curl_setopt($process, CURLOPT_POST, 1);
         $response = curl_exec($process); 
         $header_size = curl_getinfo($process,CURLINFO_HEADER_SIZE);
         $result['Header'] = HeaderProc(substr($response, 0, $header_size),'',1);
         foreach($result['Header'] as $HeaderK=>$HeaderP){
            if(!is_array($HeaderP['Set-Cookie']))continue;
           foreach($HeaderP['Set-Cookie'] as $key=>$val){
             $result['Header'][$HeaderK]['Set-Cookie'][$key]=$this->CookieAnalysis($val);
           }
         }
         $result['Body'] = substr( $response, $header_size );
         $result['HTTP_State'] = curl_getinfo($process,CURLINFO_HTTP_CODE);
         $result['URL'] = curl_getinfo($process,CURLINFO_EFFECTIVE_URL);
         curl_close($process); 
         return $result; 
    }
    function error($error) { 
         echo "<center><div style='width:500px;border: 3px solid #FFEEFF; padding: 3px; background-color: #FFDDFF;font-family: verdana; font-size: 10px'><b>cURL Error</b><br>$error</div></center>";
         die; 
    } 
 }

XmlRead

 class XmlRead{    
    static function Clean($html){
   $html=preg_replace_callback("~<script(.*?)>(.*?)</script>~si",function($m){
      //print_r($m);
     // $m[2]=preg_replace("/\/\*(.*?)\*\/|[\t\r\n]/s"," ", " ".$m[2]." ");
      $m[2]=preg_replace("~//(.*?)\n~si"," ", " ".$m[2]." ");
      //echo $m[2];
      return "<script ".$m[1].">".$m[2]."</script>";
      }, $html);
  $search = array(
        "/ +/" => " ",
        "/<!–\{(.*?)\}–>|<!–(.*?)–>|[\t\r\n]|<!–|–>|\/\/ <!–|\/\/ –>|<!\[CDATA\[|\/\/ \]\]>|\]\]>|\/\/\]\]>|\/\/<!\[CDATA\[/" => "");
  //$html = preg_replace(array_keys($search), array_values($search), $html);   
  $search = array(
       "/\/\*(.*?)\*\/|[\t\r\n]/s" => "",
       "/ +\{ +|\{ +| +\{/" => "{",
       "/ +\} +|\} +| +\}/" => "}",
       "/ +: +|: +| +:/" => ":",
       "/ +; +|; +| +;/" => ";",
       "/ +, +|, +| +,/" => ","
       );
       $html = preg_replace(array_keys($search), array_values($search), $html);
       preg_match_all('!(<(?:code|pre|script).*>[^<]+</(?:code|pre|script)>)!',$html,$pre);
$html = preg_replace('!<(?:code|pre).*>[^<]+</(?:code|pre)>!', '#pre#', $html);
$html = preg_replace('#<!–[^\[].+–>#', '', $html);
$html = preg_replace('/[\r\n\t]+/', ' ', $html);
$html = preg_replace('/>[\s]+</', '><', $html);
$html = preg_replace('/\s+/', ' ', $html);
if (!empty($pre[0])) {
    foreach ($pre[0] as $tag) {
        $html = preg_replace('!#pre#!', $tag, $html,1);
    }
}
return($html);
}
function loadNprepare($content,$encod='') {
   $content=self::Clean($content);
   //$content=html_entity_decode(html_entity_decode($content));
  // $content=htmlspecialchars_decode($content,ENT_HTML5);
   $this->DataPage='';
   preg_match('~<body(.*?)>(.*?)</body>~si',$content,$M);
   $this->DataPage=$M[2];
   $HTML=$this->DataPage;
   $HTML="<!doctype html><html><head><meta charset=\"utf-8\"><title>Untitled Document</title></head><body>".$HTML."</body></html>";
   $dom= new DOMDocument; 
   $HTML = str_replace("&", "&amp;", $HTML);  // disguise &s going IN to loadXML() 
  // $dom->substituteEntities = true;  // collapse &s going OUT to transformToXML() 
   $dom->recover = TRUE;
   @$dom->loadHTML('<?xml encoding="UTF-8">' .$HTML); 
   // dirty fix
   foreach ($dom->childNodes as $item)
    if ($item->nodeType == XML_PI_NODE)
      $dom->removeChild($item); // remove hack
   $dom->encoding = 'UTF-8'; // insert proper
    return $dom;
}
function GetBYClass($Doc,$ClassName){
    $finder = new DomXPath($Doc);
    return($finder->query("//*[contains(@class, '$ClassName')]"));
}
function extractText($node) {
     if($node==NULL)return false;    
     if (XML_TEXT_NODE === $node->nodeType || XML_CDATA_SECTION_NODE === $node->nodeType) {
         return $node->nodeValue;
     } else if (XML_ELEMENT_NODE === $node->nodeType || XML_DOCUMENT_NODE === $node->nodeType || XML_DOCUMENT_FRAG_NODE === $node->nodeType) {
       if ('script' === $node->nodeName) return '';

     $text = '';
     foreach($node->childNodes as $childNode) {
        $text .= $this->extractText($childNode);
     }
     return $text;
     }
}
function DOMRemove(DOMNode $from) {

    $from->parentNode->removeChild($from);    
 }

}

call class and conf for your page

 $cc = new cURL(); //
 $XmlRead=new XmlRead();
 $Data=$cc->get('http://www.ncbi.nlm.nih.gov/sra/ERX086768?report=FullXml');
     //get page 
 $doc=$XmlRead->loadNprepare($Data['Body']);//load as html
     //remove two part of page related to your page .
 $productspec=$XmlRead->DOMRemove($XmlRead->GetBYClass($doc,'title')->item(0));
 $productspec=$XmlRead->DOMRemove($XmlRead->GetBYClass($doc,'aux')->item(0));
     //select xml part
 $productspec=$XmlRead->GetBYClass($doc,'rprt');
 foreach($productspec as $data)
 {
    $content=html_entity_decode(html_entity_decode($XmlRead->extractText($data)));//decode as entity html 
    print_r($content);  
 }

output:

 <EXPERIMENT_PACKAGE><EXPERIMENT alias="SC_EXP_7229_8#56"center_name="SC"accession="ERX086768"><IDENTIFIERS><PRIMARY_ID>ERX086768</PRIMARY_ID><SUBMITTER_ID namespace="SC">SC_EXP_7229_8#56</SUBMITTER_ID></IDENTIFIERS><TITLE></TITLE><STUDY_REF accession="ERP000913"refname="Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977"refcenter="SC"><IDENTIFIERS><PRIMARY_ID>ERP000913</PRIMARY_ID><SUBMITTER_ID namespace="SC">Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977</SUBMITTER_ID></IDENTIFIERS></STUDY_REF><DESIGN><DESIGN_DESCRIPTION>Standard</DESIGN_DESCRIPTION><SAMPLE_DESCRIPTOR accession="ERS074283"refname="MR223754-sc-2011-11-18T11:31:44Z-1306470"refcenter="SC"><IDENTIFIERS><PRIMARY_ID>ERS074283</PRIMARY_ID><SUBMITTER_ID namespace="SC">MR223754-sc-2011-11-18T11:31:44Z-1306470</SUBMITTER_ID></IDENTIFIERS></SAMPLE_DESCRIPTOR><LIBRARY_DESCRIPTOR><LIBRARY_NAME>4008297</LIBRARY_NAME><LIBRARY_STRATEGY>WGS</LIBRARY_STRATEGY><LIBRARY_SOURCE>GENOMIC</LIBRARY_SOURCE><LIBRARY_SELECTION>RANDOM</LIBRARY_SELECTION><LIBRARY_LAYOUT><PAIRED NOMINAL_LENGTH="250"></PAIRED></LIBRARY_LAYOUT></LIBRARY_DESCRIPTOR><SPOT_DESCRIPTOR><SPOT_DECODE_SPEC><READ_SPEC><READ_INDEX>0</READ_INDEX><READ_CLASS>Application Read</READ_CLASS><READ_TYPE>Forward</READ_TYPE><BASE_COORD>1</BASE_COORD></READ_SPEC><READ_SPEC><READ_INDEX>1</READ_INDEX><READ_CLASS>Application Read</READ_CLASS><READ_TYPE>Reverse</READ_TYPE><RELATIVE_ORDER follows_read_index="0"></RELATIVE_ORDER></READ_SPEC></SPOT_DECODE_SPEC></SPOT_DESCRIPTOR></DESIGN><PLATFORM><ILLUMINA><INSTRUMENT_MODEL>Illumina HiSeq 2000</INSTRUMENT_MODEL></ILLUMINA></PLATFORM><PROCESSING></PROCESSING></EXPERIMENT><SUBMISSION accession="ERA119046"center_name="SC"submission_date="2012-04-17T09:29:50Z"alias="ERP000913-sc-20120417-2"lab_name=""><IDENTIFIERS><PRIMARY_ID>ERA119046</PRIMARY_ID><SUBMITTER_ID namespace="SC">ERP000913-sc-20120417-2</SUBMITTER_ID></IDENTIFIERS></SUBMISSION><STUDY alias="Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977"center_name="SC"accession="ERP000913"><IDENTIFIERS><PRIMARY_ID>ERP000913</PRIMARY_ID><SUBMITTER_ID namespace="SC">Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977</SUBMITTER_ID></IDENTIFIERS><DESCRIPTOR><STUDY_TITLE>Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis</STUDY_TITLE><STUDY_TYPE existing_study_type="Whole Genome Sequencing"></STUDY_TYPE><STUDY_ABSTRACT>http://www.sanger.ac.uk/resources/downloads/bacteria/</STUDY_ABSTRACT><CENTER_PROJECT_NAME>Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis</CENTER_PROJECT_NAME><STUDY_DESCRIPTION>http://www.sanger.ac.uk/resources/downloads/bacteria/This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria),please see http://www.sanger.ac.uk/datasharing/</STUDY_DESCRIPTION></DESCRIPTOR></STUDY><SAMPLE alias="MR223754-sc-2011-11-18T11:31:44Z-1306470"center_name="SC"accession="ERS074283"><IDENTIFIERS><PRIMARY_ID>ERS074283</PRIMARY_ID><SUBMITTER_ID namespace="SC">MR223754-sc-2011-11-18T11:31:44Z-1306470</SUBMITTER_ID></IDENTIFIERS><SAMPLE_NAME><COMMON_NAME>Streptococcus dysgalactiae subspecies equisimilis</COMMON_NAME><TAXON_ID>119602</TAXON_ID><SCIENTIFIC_NAME>Streptococcus dysgalactiae subsp. equisimilis</SCIENTIFIC_NAME></SAMPLE_NAME><SAMPLE_LINKS><SAMPLE_LINK><ENTREZ_LINK><DB>biosample</DB><ID>859730</ID></ENTREZ_LINK></SAMPLE_LINK></SAMPLE_LINKS><SAMPLE_ATTRIBUTES><SAMPLE_ATTRIBUTE><TAG>Strain</TAG><VALUE>MR223754</VALUE></SAMPLE_ATTRIBUTE><SAMPLE_ATTRIBUTE><TAG>Sample Description</TAG><VALUE></VALUE></SAMPLE_ATTRIBUTE><SAMPLE_ATTRIBUTE><TAG>ArrayExpress-StrainOrLine</TAG><VALUE>MR223754</VALUE></SAMPLE_ATTRIBUTE><SAMPLE_ATTRIBUTE><TAG>ArrayExpress-Sex</TAG><VALUE>not applicable</VALUE></SAMPLE_ATTRIBUTE><SAMPLE_ATTRIBUTE><TAG>ArrayExpress-Species</TAG><VALUE>Streptococcus dysgalactiae subspecies equisimilis</VALUE></SAMPLE_ATTRIBUTE></SAMPLE_ATTRIBUTES></SAMPLE><RUN_SET><RUN alias="SC_RUN_7229_8#56"center_name="SC"accession="ERR109334"total_spots="2708543"total_bases="406281450"size="334475592"load_done="true"published="2012-04-27 20:11:35"is_public="true"cluster_name="public"static_data_available="1"><IDENTIFIERS><PRIMARY_ID>ERR109334</PRIMARY_ID><SUBMITTER_ID namespace="SC">SC_RUN_7229_8#56</SUBMITTER_ID></IDENTIFIERS><EXPERIMENT_REF refname="SC_EXP_7229_8#56"refcenter="SC"accession="ERX086768"><IDENTIFIERS><PRIMARY_ID>ERX086768</PRIMARY_ID><SUBMITTER_ID namespace="SC">SC_EXP_7229_8#56</SUBMITTER_ID></IDENTIFIERS></EXPERIMENT_REF><Pool><Member member_name=""accession="ERS074283"sample_name="MR223754-sc-2011-11-18T11:31:44Z-1306470"spots="2708543"bases="406281450"></Member></Pool></RUN></RUN_SET></EXPERIMENT_PACKAGE>

"Using $this when not in object context " sy buti got this error on the preg_replace_callback on line 205 (function loadNprepare) — jose bode, Apr 06 '13 at 20:32
because your php version is old no problem i update this part of code — mohammad mohsenipur, Apr 06 '13 at 20:42
IMHO this class is nothing you should use. It does too much without explaining anything and has such a bad code quality. On top of it, it is not necessary. Plain DOMDocument and DOMXpath work fine, see my answer: http://stackoverflow.com/a/15863656/367456 - it is doint the job with a fraction of the code. Even *formatting* the output properly. — hakre, Apr 07 '13 at 14:50

extract xml from xml embebed in html

5 Answers5

Linked