0

I have this xml file

http://www.metacafe.com/tags/cats/rss.xml

With this code:

$xml = simplexml_load_file('http://www.metacafe.com/tags/cats/rss.xml', 'SimpleXMLElement', LIBXML_NOCDATA);
echo $xml->channel->item->title . "<br>";
echo $xml->channel->item->description . "<br>";

I get this OUTPUT:

Dad Challenges Kids to Climb Walls to Get Candy<br>
<a href="http://www.metacafe.com/watch/cb-M0fIp1ctKtsn/dad_challenges_kids_to_climb_walls_to_get_candy/"><img src="http://s3.mcstatic.com/thumb/11150410/28824820/4/directors_cut/0/1/dad_challenges_kids_to_climb_walls_to_get_candy.jpg?v=1" align="right" border="0" alt="Dad Challenges Kids to Climb Walls to Get Candy" vspace="4" hspace="4" width="134" height="78" /></a>
                <p>
                Nick Dietz compiles some of the week's best viral videos, 
                including an elephant trying really hard to break a stick, a cat
                sunbathing and kids climbing up the walls to get candy. Plus, 
                making  music with a Ford Fiesta.                              
                <br>Ranked <strong>4.00</strong> / 5 | 78 views | <a href="http://www.metacafe.com/watch/cb-M0fIp1ctKtsn/dad_challenges_kids_to_climb_walls_to_get_candy/">0 comments</a><br/>
                </p>
                <p>
                 <a href="http://www.metacafe.com/watch/cb-M0fIp1ctKtsn/dad_challenges_kids_to_climb_walls_to_get_candy/"><strong>Click here to watch the video</strong></a> (02:38)<br/>
                    Submitted By:                       <a href="http://www.metacafe.com/channels/CBS/">CBS</a><br/>
                    Tags:
                    <a href="http://www.metacafe.com/topics/penna/">Penna</a>&nbsp;
                    <a href="http://www.metacafe.com/topics/bjbj/">Bjbj</a>&nbsp;
                    <a href="http://www.metacafe.com/topics/ciao/">Ciao</a>&nbsp;                   <br/>
                    Categories: <a href='http://www.metacafe.com/videos/entertainment/'>Entertainment</a>
               </p>

        <br>

I need get this output (than its need remove all others elements):

Dad Challenges Kids to Climb Walls to Get Candy
Nick Dietz compiles some of the week's best viral videos, 
including an elephant trying really hard to break a stick, a cat
sunbathing and kids climbing up the walls to get candy. Plus, 
making  music with a Ford Fiesta.

I dont know how proceed to get this result.

Vincenzo Lo Palo
  • 1,341
  • 5
  • 19
  • 32
  • it's html... you're already using DOM operations to get at the xml node. it's a simple extension of that to tear apart the html in that node and suck out only the bits you want. – Marc B Nov 25 '13 at 18:18
  • can you show me an example please? – Vincenzo Lo Palo Nov 25 '13 at 18:21
  • Note that it's entirely unnecessary to pass `LIBXML_NOCDATA` into SimpleXML; as soon as you ask for the string content of an element, all CDATA and text nodes will be flattened in appropriately. If you're doing something other than `echo`, the syntax to force a variable to be a string is `(string)$var`, e.g. `$html = (string)$xml->channel->item->description`. – IMSoP Nov 25 '13 at 22:51

1 Answers1

1

The reason you're getting the elements inside description is the CDATA section. For the XML-Parser the content of a CDATA session is always text. Elements like a <p> are not read into the DOM structure.

A simple strip_tags() will delete all elements. For more control you need to load the html fragment into a DOM:

$html = <<<'HTML'
<a href="http://www.metacafe.com/watch/cb-M0fIp1ctKtsn/dad_challenges_kids_to_climb_walls_to_get_candy/"><img src="http://s3.mcstatic.com/thumb/11150410/28824820/4/directors_cut/0/1/dad_challenges_kids_to_climb_walls_to_get_candy.jpg?v=1" align="right" border="0" alt="Dad Challenges Kids to Climb Walls to Get Candy" vspace="4" hspace="4" width="134" height="78" /></a>
                <p>
                Nick Dietz compiles some of the week's best viral videos, 
                including an elephant trying really hard to break a stick, a cat
                sunbathing and kids climbing up the walls to get candy. Plus, 
                making  music with a Ford Fiesta.                              
                <br>Ranked <strong>4.00</strong> / 5 | 78 views | <a href="http://www.metacafe.com/watch/cb-M0fIp1ctKtsn/dad_challenges_kids_to_climb_walls_to_get_candy/">0 comments</a><br/>
                </p>
                <p>
                 <a href="http://www.metacafe.com/watch/cb-M0fIp1ctKtsn/dad_challenges_kids_to_climb_walls_to_get_candy/"><strong>Click here to watch the video</strong></a> (02:38)<br/>
                    Submitted By:                       <a href="http://www.metacafe.com/channels/CBS/">CBS</a><br/>
                    Tags:
                    <a href="http://www.metacafe.com/topics/penna/">Penna</a>&nbsp;                 <br/>
                    Categories: <a href='http://www.metacafe.com/videos/entertainment/'>Entertainment</a>
               </p>

        <br>
HTML;

$dom = new DOMDocument();
$dom->loadHtml($html);
$xpath = new DOMXPath($dom);

$content = $xpath->evaluate("string(//p[1]/text())");
var_dump($content);

The Xpath Expression

//p/text()[1] is the first text node inside a p. The string() function converts it into a string. If the node does not exists, the expression will return an empty string.

ThW
  • 19,120
  • 3
  • 22
  • 44
  • 1
    Edited an added an example. – ThW Nov 25 '13 at 18:46
  • only an other question please: if Would like get only anchor text about tags? I mean: Penna, Bjbj, Ciao. Thanks for your precious help to my project! – Vincenzo Lo Palo Nov 25 '13 at 22:27
  • 1
    $xpath->evaluate("//a"); will return a DOMNodeList of DOMElement nodes. You can iterator it with foreach() and read the $nodeValue property. – ThW Nov 25 '13 at 22:36
  • "Escaped elements are transformed (`>` back to `>`)" - unless I'm misunderstanding, this is wrong: CDATA preserves all data exactly as-is, until it reaches the ending `]]>`. – IMSoP Nov 25 '13 at 22:50
  • Escaping `]]>` as `]]>` in a CDATA block will not work. The `&` remains a literal, so you just have, literally, `]]>` See http://stackoverflow.com/questions/538163/how-do-i-write-the-literal-inside-a-cdata-section-with-it-ending-the-secti – IMSoP Oct 09 '14 at 13:33
  • You're right, the CDATA section needs to be splitted. DOM does that automatically. – ThW Oct 09 '14 at 14:27