4

Is there a maximum file size the XMLReader can handle?

I'm trying to process an XML feed about 3GB large. There are certainly no PHP errors as the script runs fine and successfully loads to the database after it's been run.

The script also runs fine with smaller test feeds - 1GB and below. However, when processing larger feeds the script stops reading the XML File after about 1GB and continues running the rest of the script.

Has anybody experienced a similar problem? and if so how did you work around it?

Thanks in advance.

apaderno
  • 28,547
  • 16
  • 75
  • 90
  • 2
    Are you *certain* no PHP errors are being generated? What precisely (as far as you can tell) is the determining factor between working and not working? What does "the script" look like, what else is it doing besides iterating over the XML? – salathe Aug 06 '10 at 14:49
  • In pseudo code the script would look something like this $this->downloadFeed(); try{ $this->writeXMLFeedToCSV(); }catch(e){ //handle exception } $this->uploadCSVToDatabaseTable(); If the script failed due to a PHP error, it would not upload to the database. It currently does. The xml is also properly formed, as when the script is broken down, as ircmaxell suggested it works fine. However the process is tedious and was hoping to find a solution. Sorry, due to the nature of the information I am not at liberty to share the script. – A boy named Su Aug 06 '10 at 15:10
  • Which a) operating system b) filesystem c) version of php d) build of php do you use for testing? – VolkerK Aug 06 '10 at 15:34

6 Answers6

2

I had same kind of problem recently and I thought to share my experience.

It seems that problem is in the way PHP was compiled, whether it was compiled with support for 64bit file sizes/offsets or only with 32bit.

With 32bits you can only address 4GB of data. You can find a bit confusing but good explanation here: http://blog.mayflower.de/archives/131-Handling-large-files-without-PHP.html

I had to split my files with Perl utility xml_split which you can find here: http://search.cpan.org/~mirod/XML-Twig/tools/xml_split/xml_split

I used it to split my huge XML file into manageable chunks. The good thing about the tool is that it splits XML files over whole elements. Unfortunately its not very fast.

I needed to do this one time only and it suited my needs, but I wouldn't recommend it repetitive use. After splitting I used XMLReader on smaller files of about 1GB in size.

nickhar
  • 19,981
  • 12
  • 60
  • 73
gazda
  • 41
  • 1
  • 8
1

Splitting up the file will definitely help. Other things to try...

  1. adjust the memory_limit variable in php.ini. http://php.net/manual/en/ini.core.php
  2. rewrite your parser using SAX -- http://php.net/manual/en/book.xml.php . This is a stream-oriented parser that doesn't need to parse the whole tree. Much more memory-efficient but slightly harder to program.

Depending on your OS, there might also be a 2gb limit on the RAM chunk that you can allocate. Very possible if you're running on a 32-bit OS.

Vineel Shah
  • 960
  • 6
  • 14
  • The XMLReader interface is supposed to handle large documents sequentially like a SAX parser, i.e. it doesn't (necessarily) load the entire document into memory. – VolkerK Aug 06 '10 at 15:41
  • thanks for that. Had already adjusted the internal memory. VolkerK is right as well. XMLReader reads in a similar manner to to SAX parser. I will try it with SAX if all else fails but would rather not having to rewrite the script. – A boy named Su Aug 06 '10 at 16:29
1

It should be noted that PHP in general has a max file size. PHP does not allow for unsigned integers, or long integers, meaning you're capped at 2^31 (or 2^63 for 64 bit systems) for integers. This is important because PHP uses an integer for the file pointer (your position in the file as you read through), meaning it cannot process a file larger than 2^31 bytes in size.

However, this should be more than 1 gigabyte. I ran into issues with two gigabytes (as expected, since 2^31 is roughly 2 billion).

Soup d'Campbells
  • 2,333
  • 15
  • 14
0

I've run into a similar issue when parsing large documents. What I wound up doing is breaking the feed into smaller chunks using filesystem functions, then parsing those smaller chunks... So if you have a bunch of <record> tags that you are parsing, parse them out with string functions as a stream, and when you get a full record in the buffer, parse that using the xml functions... It sucks, but it works quite well (and is very memory efficient, since you only have at most 1 record in memory at any one time)...

ircmaxell
  • 163,128
  • 34
  • 264
  • 314
  • Thanks, yes that's what I ended up doing as well. But as you mentioned, it sucks :o) Do you happen to know as a fact wether or not there is a max file size the xml reader can read? – A boy named Su Aug 06 '10 at 15:11
  • Thanks again for your suggestion, I discovered the source of error and a solution that has been working for me so far and thought you might be able to implement it. It turns out that there was a vertical tab in feed (^K or char 11) which isn't an invalid character but invalid for the document type I was using. I ran the feed through a sed find and replace before processing the feed and have since been able to parse fields greater than 2gb. Thanks to everybody else for your suggestions. – A boy named Su Aug 19 '10 at 14:14
0

Do you get any errors with

libxml_use_internal_errors(true);
libxml_clear_errors();

// your parser stuff here....    
$r = new XMLReader(...);
// ....


foreach( libxml_get_errors() as $err ) {
   printf(". %d %s\n", $err->code, $err->message);
}

when the parser stops prematurely?

VolkerK
  • 95,432
  • 20
  • 163
  • 226
  • No, don't get any. I'm putting together a standalone copy of the script that may shed some more light on the problem, but I'm quite certain it's not a problem with the XML or the PHP script itself. As long as the file is less than 1GB it runs the way it's supposed to with no problem. even when larger, it runs fine, just doesn't read all the xml. Thanks for the suggestion though. – A boy named Su Aug 06 '10 at 16:26
  • "but I'm quite certain it's not a problem with the XML or the PHP script itself." - Only to make sure: The libxml_get_errors() thingy was not to imply there's something wrong with the script or the xml document. I thought libxml might complain about a failed file seek or a text node that is larger than the allowed maximum (which by default is 10MB) or something like that. If you ran into the problem without libxml_get_errors() returning an error this idea is dead :( – VolkerK Aug 06 '10 at 16:33
  • :o) I know that's what you implied. I'm not sensitive - i wasn't being defensive. Sorry if I came across as such. – A boy named Su Aug 06 '10 at 16:40
0

Using WindowsXP, NTFS as filesystem and php 5.3.2 there was no problem with this test script

<?php
define('SOURCEPATH', 'd:/test.xml');

if ( 0 ) {
  build();
}
else {
  echo 'filesize: ', number_format(filesize(SOURCEPATH)), "\n";
  timing('read');
}

function timing($fn) {
  $start = new DateTime();
  echo 'start: ', $start->format('Y-m-d H:i:s'), "\n";
  $fn();
  $end = new DateTime();
  echo 'end: ', $start->format('Y-m-d H:i:s'), "\n";
  echo 'diff: ', $end->diff($start)->format('%I:%S'), "\n";
}

function read() {
  $cnt = 0;
  $r = new XMLReader;
  $r->open(SOURCEPATH);
  while( $r->read() ) {
    if ( XMLReader::ELEMENT === $r->nodeType ) {
      if ( 0===++$cnt%500000 ) {
        echo '.';
      }
    }
  }
  echo "\n#elements: ", $cnt, "\n";
}

function build() {
  $fp = fopen(SOURCEPATH, 'wb');

  $s = '<catalogue>';
  //for($i = 0; $i < 500000; $i++) {
  for($i = 0; $i < 60000000; $i++) {
    $s .= sprintf('<item>%010d</item>', $i);
    if ( 0===$i%100000 ) {
      fwrite($fp, $s);
      $s = '';
      echo $i/100000, ' ';
    }
  }

  $s .= '</catalogue>';
  fwrite($fp, $s);
  flush($fp);
  fclose($fp);
}

output:

filesize: 1,380,000,023
start: 2010-08-07 09:43:31
........................................................................................................................
#elements: 60000001
end: 2010-08-07 09:43:31
diff: 07:31

(as you can see I screwed up the output of the end-time but I don't want to run this script another 7+ minutes ;-))

Does this also work on your system?


As a side-note: The corresponding C# test application took only 41 seconds instead of 7,5 minutes. And my slow harddrive might have been the/one limiting factor in this case.

filesize: 1.380.000.023
start: 2010-08-07 09:55:24
........................................................................................................................

#elements: 60000001

end: 2010-08-07 09:56:05
diff: 00:41

and the source:

using System;
using System.IO;
using System.Xml;

namespace ConsoleApplication1
{
  class SOTest
  {
    delegate void Foo();
    const string sourcepath = @"d:\test.xml";
    static void timing(Foo bar)
    {
      DateTime dtStart = DateTime.Now;
      System.Console.WriteLine("start: " + dtStart.ToString("yyyy-MM-dd HH:mm:ss"));
      bar();
      DateTime dtEnd = DateTime.Now;
      System.Console.WriteLine("end: " + dtEnd.ToString("yyyy-MM-dd HH:mm:ss"));
      TimeSpan s = dtEnd.Subtract(dtStart);
      System.Console.WriteLine("diff: {0:00}:{1:00}", s.Minutes, s.Seconds);
    }

    static void readTest()
    {
      XmlTextReader reader = new XmlTextReader(sourcepath);
      int cnt = 0;
      while (reader.Read())
      {
        if (XmlNodeType.Element == reader.NodeType)
        {
          if (0 == ++cnt % 500000)
          {
            System.Console.Write('.');
          }
        }
      }
      System.Console.WriteLine("\n#elements: " + cnt + "\n");
    }

    static void Main()
    {
      FileInfo f = new FileInfo(sourcepath);
      System.Console.WriteLine("filesize: {0:N0}", f.Length);
      timing(readTest);
      return;
    }
  }
}
VolkerK
  • 95,432
  • 20
  • 163
  • 226