0

I'm using PHPExcel which I've used before many times. The problem I have now is when reading Excel2007 files (.xlsx - format). What I'm doing is simply looping the the .xlsx file and creating an array by row/column and then print_r()-ing the results after the read operation to make sure the data output is good before importing it into a MySQL database.

Now when reading the Excel2007 .xlsx file (6MB) the output fails, but whats interesting is if I save the file as the older format .xls (1992-2004 - Excel5) the file becomes larger (16MB) but outputs correctly. This made me think originally it wasn't a memory problem since the older larger .xls file (16MB) ran with no problems and was almost 3x the size of the .xlsx file (6MB).

For test purposes I then copied 25 rows of the 30,000 in the .xlsx (6MB) file and created a new Excel2007 .xlsx and ran the import against the smaller 25 row data-set and it output correctly. This then led me think that it is a memory problem however related specifically to the .xlsx format...

I'm running the server on Amazon Web Services and have C4.Xlarge (16-core, 30GB RAM) so I should have plenty of resources to run this operation.


Question: Why does my output fail when reading a smaller .xlsx file vs a larger .xls file, but then succeed with a smaller .xlsx (25-row) file?


//PHP Function

function parse_xls($file){
    ini_set('memory_limit','-1');
    $type = PHPExcel_IOFactory::identify($file);
    $reader = PHPExcel_IOFactory::createReader($type);
    $reader->setReadDataOnly(true);
    $xls = $reader->load($file); 
    $sheet = $xls->getActiveSheet();
    $highestRow = $sheet->getHighestRow();
    $highestColumn = $sheet->getHighestColumn();
    $highestColumnIndex = PHPExcel_Cell::columnIndexFromString($highestColumn);
    for($row=2; $row <= ($highestRow+2); $row++){
        $import[$row] = [];
        for($col=0; $col < $highestColumnIndex; $col++){
            $result = $sheet->getCellByColumnAndRow($col, $row)->getValue();
            array_push($import[$row],$result);
        }
    }
    print_r($import);
    die();
}
Jordan Davis
  • 1,485
  • 7
  • 21
  • 40
  • I haven't a clue! There's any number of possible reasons.... Physical filesize doesn't reflect the number of worksheets/rows/columns. Content of the cells, number of different styles in the file, etc can all affect memory requirements.... and it's quite possible that saving as xls eliminates unused styles or references that exists in the xlsx – Mark Baker Feb 09 '16 at 17:34
  • Are you using cell caching at all? If you're building an array of rows/columns then you're significantly increasing the memory demands of your script – Mark Baker Feb 09 '16 at 17:35
  • @MarkBaker I just posted the function I wrote so you can take a look, maybe you can find something I missed. – Jordan Davis Feb 09 '16 at 17:38
  • 2
    Well `$import` is going to demand a large amount of memory.... PHP arrays are very resource intensive.... and even then you could use PHPExcel Worksheet's `toArray()` method instead rather than write your own loop, but that would still be as memory intensive – Mark Baker Feb 09 '16 at 17:42
  • But using Cell Caching can help reduce the memory footprint, though at a cost in execution speed – Mark Baker Feb 09 '16 at 17:43
  • @MarkBaker I'll def take a look at `toArray()` wasn't aware of that! still odd though that the larger `.xls` file outputs fine you know. – Jordan Davis Feb 09 '16 at 17:44
  • @MarkBaker Do you have a reference link to a good articles on cell caching? or could you post an answer modifying my function to use the `toArray()` and cell caching so I can throw it in and see if it works. – Jordan Davis Feb 09 '16 at 17:46
  • File size doesn't relate to memory usage, and OfficeOpenXML and BIFF are very different formats that store data in very different ways, so they can have very different sizes – Mark Baker Feb 09 '16 at 17:47
  • [PHPExcel Configuration Settings](https://github.com/PHPOffice/PHPExcel/blob/develop/Documentation/markdown/Overview/04-Configuration-Settings.md) – Mark Baker Feb 09 '16 at 17:47
  • @MarkBaker got it, yea I figured it was just a more condensed format if it's being reduced almost 3x times the size, but PHP should still be able to do it specially if the servers got 30GB of RAM. – Jordan Davis Feb 09 '16 at 17:48
  • Not condensed at all, completely different.... OfficeOpenXML (xlsx) is a collection of XML files, compressed into a zip archive; BIFF (xls) is a binary format within an OLE container.... totally different animals – Mark Baker Feb 09 '16 at 17:50
  • @MarkBaker wow ok did not know that but makes sense since `.xlsx` requires `php_zip` module – Jordan Davis Feb 09 '16 at 17:53
  • If you're running 32-bit PHP, there is an absolute memory limit of 2GB that PHP won't exceed – Mark Baker Feb 09 '16 at 17:54
  • @MarkBaker yea I do know the memory limit is 2GB, but I don't think it would be anywhere near that size, you know. – Jordan Davis Feb 09 '16 at 18:10
  • Perhaps between decompressing, reading in the whole Excel file, and then trying to store it all in a 2D array, the program ends up using the full 2GB of RAM. It seems unlikely, but I cannot think of anything else. Do you have access to the error logs so you can see exactly which errors were raised? – Spencer D Feb 09 '16 at 18:15
  • @SpencerDoak yea I think your right on the money, it's got to be related to the compression, and using the 2GB because the smaller `.xlsx` file decompresses and outputs just fine. Yea I have an active pipe `tail -f /var/log/httpd/error_log` watching the log, but get no errors, smh... – Jordan Davis Feb 09 '16 at 18:18
  • @SpencerDoak I'm implement Mark's `toArray()` and caching to see if that works at all, unless you have any other suggestions? – Jordan Davis Feb 09 '16 at 18:19
  • @MarkBaker wow `toArray()` is the one lol, so much easier, and it's outputting perfectly now thank you thank you, post the answer if you would like and I'll mark it correct. – Jordan Davis Feb 09 '16 at 19:18
  • 2
    Glad you got it working – Mark Baker Feb 09 '16 at 19:20

1 Answers1

0

for big files i use chunkReadFilter

$iChunkSize=1000; for($iStartRow = $row_start; $iStartRow <= $totalRows; $iStartRow += $iChunkSize) { $objReader = $oExcel->SetCreateReader(); $oChunkFilter = new chunkReadFilter(); $objReader->setReadFilter($oChunkFilter); $oChunkFilter->setRows($iStartRow,$iChunkSize); $objReader->setReadFilter($oChunkFilter); $objReader->setReadDataOnly(true); $objPHPExcel = $objReader->load($files['path']); $objPHPExcel->setActiveSheetIndex($iList); $sFromCell = 'A'.$iStartRow; $aData = $objPHPExcel->getActiveSheet()->toArray(null,true,true,false,$sFromCell); // free memory unset($objPHPExcel); unset($objReader); unset($oChunkFilter); // parse data foreach ($aData as $sKey => $aValue) { ... } // real data rows if (count($aData) < $iChunkSize) { unset($aData); break; } unset($aData); }

wowik
  • 16
  • 1