0
$url  = "http://example.com/get-xml.php"; // contains broken XML
$file = file_get_contents($url);
$xml  = simplexml_load_string($file);

Message received when simplexml_load_string is called:

"Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 216: parser error : Specification mandate value for attribute mods in"

Warning: simplexml_load_string() [function.simplexml-load-string]:

In summary, there's a XML tag with a space in it, and it's breaking everything.

So using PHP, I'm importing XML from a third party and the bad XML tag breaks the whole import. Is there a better way to read in the non-XML by looking at each specific XML tag? Or can I at least ignore the broken tags?

I guess ideally I would want a file_get_contents method that shows the XML tag too. Any suggestions for a noob? I'm not able to change the 3rd party XML as I get it from a remote service I don't have any influence on.

Community
  • 1
  • 1
johnpecan
  • 9
  • 4
  • Please let us know what you have tried so far. For example, what didn't work for you in this potentially duplicate question: [Problem with invalid XML/HTML in PHP DOM](http://stackoverflow.com/q/6393401/367456)? You also miss to show the broken XML in your question. Just providing and example URL that obviously is not giving the XML is not helpful because we do not see where your problem lies. Also next to DOM there is the Tidy extension that can parse invalid XML and also fix it. – hakre May 01 '13 at 08:55

1 Answers1

0

PHP 5.1+ allows you to parse not well-formed XML documents and adds the missing elements, eg. missing closing tags.

This can be very useful, if you have to parse XML documents, on which you don't have any influence.

To use this feature, you just have to set the DomDocument property recover to true before loading the XML document and then loading the XML document will always return something more or less useful:

<?php
$xml = new DomDocument();
$xml->recover=true;
$xml->loadXML('<root><tag>hello world</root>');
print $xml->saveXML();
?>

which will return (besides a bunch of errors, the result will still show up).

code demo here: phpFiddle

Updated to bring the xml as it is:

if you can use curl this should achieve your goal.. try it an let me know

<?php
function curl_get_file_contents($URL)
    {
        $c = curl_init();
        curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($c, CURLOPT_URL, $URL);
        $contents = curl_exec($c);
        curl_close($c);

        if ($contents) return $contents;
            else return FALSE;
    }
?>
Dinesh
  • 3,065
  • 1
  • 18
  • 19
  • That's some interesting information but I don't see how it helps me get rid of or ignore the XML tag with a space in it that's breaking my import... – johnpecan May 01 '13 at 05:25
  • if you dont know at what location the space is.. you can still use this function to get the required data.. – Dinesh May 01 '13 at 05:27
  • I know the location of where the error occurs. What I don't know is how to fix it. – johnpecan May 01 '13 at 05:30
  • you will need to do regex parsing to replace your space..check this out.. it may give you the idea how to approach it http://stackoverflow.com/questions/5210287/how-replace-all-spaces-inside-html-elements-with-nbsp-using-preg-replace – Dinesh May 01 '13 at 05:36
  • Sorry I'm not being clear. In my example, after I get the string from file_get_contents I have all of the XML, but not the tags. I could easily fix the bad tag in a regex if I had access to it, but the way file_get_contents works, I never even get to see the bad xml tag! – johnpecan May 01 '13 at 05:42
  • sorry for all the confusion... i have updated the code and if I understand what you want to achieve the updated code should work fine.. – Dinesh May 01 '13 at 05:59
  • Looks like still the same issue when using your "curl_get_file_contents" function instead of "file_get_contents". Thanks for trying though. – johnpecan May 01 '13 at 06:13