3

Like many others, I have been struggling with PHPExcel memory usage when reading a file (to convert it to MySQL).

Of course I have tried the usual things mentioned in various places and have been able to increase memory efficiency by at least 40%. This includes using a custom chunked reader class, moving chunked reader instantiation outside of the read loop, etc.

I have 16G of RAM on my test server and allocated a 2G max memory usage in PHP. For files under ~200K rows, PHPExcel will work (slowly but surely). Once over a certain size the script fails simply outputting "Killed" to the shell. The logs showed that the kernel killed PHP because it had used too much memory. While watching CPU and memory usage using the top command I can see memory free and swap free plummet while memory used and swap used skyrocket.

After reading a LOT about PHPExcel and looking at some of the source files I have come to the conclusion that a lot of data is stored about each cell that is not needed when processing only text. Using:

$objReader->setReadDataOnly(true);

helps a bit, but doesn't really do that much... However, using a chunked reader and setting the chunk size to something small then using unset() to clean up big variables should theoretically work. I know that PHPExcel must read the whole file each time, but it shouldn't be storing it in memory right?

Here is the code I am currently using:

<?php

date_default_timezone_set("America/New_York");
set_time_limit(7200);
ini_set('memory_limit', '2048M');

include_once("classes/PHPExcel/PHPExcel/IOFactory.php");

$inputFileName = "/PATH/TO/FILE.xlsx";
$inputFileType = PHPExcel_IOFactory::identify($inputFileName);
$worksheetName = "Sheet1";

class chunkReadFilter implements PHPExcel_Reader_IReadFilter
{
    private $_startRow = 0;
    private $_endRow = 0;

    public function __construct($startRow, $chunkSize)
    {
        $this->_startRow = $startRow;
        $this->_endRow = $startRow + $chunkSize;
    }

    public function setRows($startRow, $chunkSize)
    {
        $this->_startRow = $startRow;
        $this->_endRow   = $startRow + $chunkSize;
    }

    public function readCell($column, $row, $worksheetName = '')
    {
        if (($row == 1) || ($row >= $this->_startRow && $row < $this->_endRow))
        {
            return true;
        }
        return false;
    }
}


$objReader = PHPExcel_IOFactory::createReader($inputFileType);
$objReader->setReadDataOnly(true);

$chunkSize = 1000;

echo "Got here 1\n";

$chunkFilter = new chunkReadFilter(2,$chunkSize);


for ($startRow = 2; $startRow <= 378767; $startRow += $chunkSize)
{
    $chunkFilter->setRows($startRow, $chunkSize);
    $objReader->setReadFilter($chunkFilter);
    echo "Got here 2\n";

    $objPHPExcel = $objReader->load($inputFileName);
    echo "Got here 3\n";

    $sheet = $objPHPExcel->getSheetByName($worksheetName);
    echo "Got here 4\n";

    $highestRow = $sheet->getHighestRow(); 
    $highestColumn = $sheet->getHighestColumn();
    echo "Got here 5\n";

    $sheetData = $sheet->rangeToArray("A".$startRow.":".$highestColumn.$highestRow, NULL, TRUE, FALSE);
    print_r($sheetData);
    echo "\n\n";
}

?>

Which outputs:

[USER@BOX Directory]# php PhpExcelBigFileTest.php
Got here 1
Got here 2
Killed

This leads to the question: Is PHPExcel trying to load the whole file into memory regardless of my filter? If it is, why is PHP not stopping it at 2G memory usage, but allowing it to get so bad the kernel has to kill PHP?

user2395126
  • 526
  • 1
  • 7
  • 20
  • The whole file is always loaded into memory; the chunking determines which cells are loaded into the PHPExcel object from that file – Mark Baker Mar 12 '15 at 21:23
  • Ahh I see! Is there any way to load avoid loading the whole file into memory? These big Excel files are killer... – user2395126 Mar 12 '15 at 21:25
  • 'fraid not.... not until the day when I get time to switch from SimpleXML to XMLReader – Mark Baker Mar 12 '15 at 21:27
  • Have you looked at cell caching as well, as a way of reducing the memory footprint? – Mark Baker Mar 12 '15 at 21:27
  • That makes sense. I didn't realize that it used SimpleXML. I have not found anything about cell caching, how would I do that? – user2395126 Mar 12 '15 at 21:30
  • Also, if you put your comment in answer form I will accept it. Thanks for all you do maintaining PHPExcel! – user2395126 Mar 12 '15 at 21:45
  • Until PHPExcel provides a streaming option, you can take a look at Spout: https://github.com/box/spout. It was implemented as a solution for this use case, so you won't have any problem reading more than 200K rows. And it's fast!! – Adrien Mar 15 '15 at 04:41

1 Answers1

4

PHPExcel currently uses SimpleXML to read XML-based formats such as OfficeOpenXML (xlsx), OASIS (.odc) and Gnumeric rather than the more memory efficient XMLReader. This means that every file XML file within the zipped archive is directly loaded into PHP memory for parsing, and building the PHPExcel object. While cell chunking reduces the memory used by the PHPExcel object by reducing the number of cells that it holds to those defined by the read filter, it still requires the entire file to be loaded in memory for SimpleXML to parse it.

The development team has looked at streaming data directly from the zipped archive to PHP's pull-parser XMLReader, and initial experiments suggest that this is highly memory efficient; but it is also a major piece of code rewriting to refector the spreadsheet readers to use this method; and with limited development resource and limited available time for doing the work, it's not a task to be undertaken lightly.

In addition to reducing memory by only loading a subset of cells into the PHPExcel object, you may also want to look at cell caching. This is described in the documentation and allows the cell objects to be stored in ways that reduce the memory that they take up. Different methods are provided to suit different systems, and the amount of memory saved will vary depending on PHP version and configuration, so you need to determine which methods are most suited to your own system. There is also a cost in speed with using cell caching. Typically SQLite is the most memory efficient, but also one of the slowest methods.

Mark Baker
  • 209,507
  • 32
  • 346
  • 385