4

Format of the xml:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE >
<root>
 <node>
  <element1></element1>
  <element2></element2>
  <element3></element2>
  <element4></element3>  
</node>
</root>

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE >
<root>
 <node>
  <element1></element1>
  <element2></element2>
  <element3></element2>
  <element4></element3>  
</node>
</root>

and several more xml declarations after. BTW, the file size 500MB. I would like to ask for help how to parse this file without breaking it up into different files using PHP.

Any help would be appreciated. Thank you..

Jan Mark
  • 147
  • 1
  • 2
  • 10
  • Readers here generally like to see some prior research before asking questions, just so you know. But fwiw, you may wish to use a 'stream reader' such as XMLReader, rather than one that loads the document fully into memory, such as SimpleXML. – halfer May 28 '12 at 08:45
  • I have already the parse code. It is just that the script will not parse the next root node. Thanks anyway for the feedback – Jan Mark May 28 '12 at 11:54
  • Your document is not considered as valid. http://stackoverflow.com/questions/5479533/problem-xml-declaration-allowed-only-at-the-start-of-the-document You can remove the extra declaration using str_replace http://stackoverflow.com/questions/2159059/string-replace-in-a-large-file-with-php And then work from a valid XML document. – baptme May 28 '12 at 07:32

1 Answers1

2

If you do not want to split the file, you will have to work with it in memory. Given your 500MB file size, this could turn out problematic. Anyway, one option would be to remove the XML Prolog and DocType from all documents and then load the whole thing like this:

$dom = new DOMDocument;
$dom->loadXML(
    sprintf(
        '<?xml version="1.0" encoding="UTF-8"?>%s' .
        '<!DOCTYPE >%s' . 
        '<roots>%s</roots>',
        PHP_EOL, 
        PHP_EOL, 
        str_replace(
            array(
                '<?xml version="1.0" encoding="UTF-8"?>', 
                '<!DOCTYPE >'
            ),
            '',
            file_get_contents('/path/to/your/file.xml')
        )
    )
);

This would make it one huge XML file with just one XML prolog and one DocType (note I am assuming the DocType is the same for all documents in the file). You could then process the file by iterating over the individual root elements.

Gordon
  • 312,688
  • 75
  • 539
  • 559
  • I am using XML reader since I am parsing a large xml file. Can you help me with the equivalent code that will work with XML reader. Since I read the xml by stream or bytes. Thanks. – Jan Mark May 28 '12 at 11:58
  • Thanks for the idea. I just remove the xml tag and doctype while streaming thru the file and added a main root. I works now. – Jan Mark May 29 '12 at 01:27
  • This works for me with a 100MB file and the code runs in about 5 seconds. Note that you'll have to allocate more memory to PHP using something like: ini_set('memory_limit', '768M'); – markashworth Jul 02 '13 at 02:53