2

I need to read a very large Excel sheet and am trying to use the chunk method in PHPOffice/PHPSpreadsheet.

The issue I'm having is that the loader is returning continually larger amounts of rows with each chunk. I'm expecting to get only a chunkSize amount of rows.

Here's the pertinent code

class chunkReadFilter implements IReadFilter
{
    private $_startRow = 0;

    private $_endRow = 0;

    public function setRows($startRow, $chunkSize)
    {
        $this->_startRow    = $startRow;
        $this->_endRow      = $startRow + $chunkSize;
        Log::info("_startRow: " . $this->_startRow . " _endRow: " . $this->_endRow);
    }

    public function readCell($column, $row, $worksheetName = '')
    {
        if (($row >= $this->_startRow && $row < $this->_endRow)) {
            return true;
        }
        return false;
    }
}


.....
$reader = IOFactory::createReaderForFile($filePath);
$chunkSize = 30;
$chunkFilter = new ChunkReadFilter();
$reader->setReadFilter($chunkFilter);
$reader->setReadDataOnly(true);


Log::info( ' ----- rows of data ' . $highestRow);
for ($startRow = 2; $startRow <= $highestRow; $startRow += $chunkSize) {

 Log::info('Loading WorkSheet using configurable filter for headings row 1 and for rows ' . $startRow . ' to ' . ($startRow + $chunkSize - 1));

 Log::info('reading chunk starting at row ' . $startRow . '-- chunk size ' . $chunkSize);

  $chunkFilter->setRows($startRow, $chunkSize);
  $spreadsheet = $reader->load($filePath);
  $spreadsheet->setActiveSheetIndexByName(ucfirst($which) . " Data");
  $records = $spreadsheet->getActiveSheet()->toArray(null, true, true, true);

  Log::info("records count: " . count($records));

  foreach($records as $key => $record) {

    # code...

   }

  }
.....

The log output looks like this. As you can see the records count grows by chunkSize with each iteration. I get chunkSize rows of data and chunkSize * i of blank rows.

Where have I gone wrong? Why am I getting all of these blank rows?

[2020-02-04 15:35:25] local.INFO: Analysis file upload started 
[2020-02-04 15:35:26] local.INFO:  ----- rows of data 101  
[2020-02-04 15:35:26] local.INFO: Loading WorkSheet using configurable filter for headings row 1 and for rows 2 to 31  
[2020-02-04 15:35:26] local.INFO: reading chunk starting at row 2-- chunk size 30  
[2020-02-04 15:35:26] local.INFO: _startRow: 2 _endRow: 32  
[2020-02-04 15:35:26] local.INFO: records count: 30  
[2020-02-04 15:35:28] local.INFO: Loading WorkSheet using configurable filter for headings row 1 and for rows 32 to 61  
[2020-02-04 15:35:28] local.INFO: reading chunk starting at row 32-- chunk size 30  
[2020-02-04 15:35:28] local.INFO: _startRow: 32 _endRow: 62  
[2020-02-04 15:35:28] local.INFO: records count: 60  
[2020-02-04 15:35:32] local.INFO: Loading WorkSheet using configurable filter for headings row 1 and for rows 62 to 91  
[2020-02-04 15:35:32] local.INFO: reading chunk starting at row 62-- chunk size 30  
[2020-02-04 15:35:32] local.INFO: _startRow: 62 _endRow: 92  
[2020-02-04 15:35:32] local.INFO: records count: 90  
[2020-02-04 15:35:38] local.INFO: Loading WorkSheet using configurable filter for headings row 1 and for rows 92 to 121  
[2020-02-04 15:35:38] local.INFO: reading chunk starting at row 92-- chunk size 30  
[2020-02-04 15:35:38] local.INFO: _startRow: 92 _endRow: 122  
[2020-02-04 15:35:38] local.INFO: records count: 100  
[2020-02-04 15:35:45] local.INFO: DONE*************************************  
Dave
  • 41
  • 4
  • 1
    Maybe: (a) read filtering works the way you're not expecting. (I would try to log all calls to `readCell()` somewhere. (you can even try to use hacky `fputs()` to save them to separate file). And the possibility (b) is that `->toArray()` actually returns all the records regardless of whether they were filtered or not. So maybe there are diffrent configuration params to it controlling it's ouput? Also if you say that there are extraneous empty rows, but actual rows are valid - you can filter them (using for loop or `array_filter`). Guts tell me ChunkFilter::readCell() behaviour to blame.. – Dimitry K Feb 04 '20 at 22:15
  • Thanks. See my reply below to Vladimir. I think I'm going to filter the empty rows from the results. – Dave Feb 05 '20 at 13:26

2 Answers2

1

Chunk reading doesn't populate rows, but it leaves them empty.
Try to set $reader->setReadEmptyCells(false);

  • Thanks tried that and it didn't work. Then found this https://github.com/PHPOffice/PhpSpreadsheet/issues/810 Checked my version of PHPSpreadsheet and the updated code is here. Not sure what is going on. May just have to build a filter to remove those extra lines. – Dave Feb 05 '20 at 13:20
0

Even it's a little bit late, I will leave here my findings:

It happens because of toArray() (I don`t know if it's intended or not) If you look at:

vendor/phpoffice/phpspreadsheet/src/PhpSpreadsheet/Worksheet/Worksheet.php:2660

return $this->rangeToArray('A1:' . $maxCol . $maxRow, $nullValue, $calculateFormulas, $formatData, $returnCellRef);

you'll see that method do not take into account the chunk size.

So, you can avoid this with ->getCoordinates()

Or:

    $worksheetInfo =  $readerService->getWorksheetInfo();
    $maxCol = $worksheetInfo['lastColumnLetter'];

    for ($startRow = 2; $startRow <= $highestRow; $startRow += $chunkSize) {

       $chunkFilter->setRows($startRow, $chunkSize);
       $maxRow = $startRow + $chunkSize;
       $spreadsheet = $reader->load($filePath);
       $spreadsheet->setActiveSheetIndexByName(ucfirst($which) . " Data");
       $records = $spreadsheet->getActiveSheet()->rangeToArray('A'.$startRow.':'.$maxCol.$maxRow);
}
Zkip
  • 53
  • 6