0

I have created this php script which takes a very long time to filter the CSV file into 19 smaller ones. The link for the CSV's is within this google drive line

https://drive.google.com/drive/folders/1Bju4kkVgo21xu3IFeyKWNJ1uClTn534J?usp=share_link

I have the process running line by line to save memory but this exceeded the maxim count time within the PHP script. Is there an improved method to achieve this file breakdown?

<?php

@date_default_timezone_set("GMT");

ini_set('memory_limit', '512M');
ini_set('max_execution_time', '300');
ini_set('auto_detect_line_endings', TRUE);
ini_set('display_errors', 1);

$inputFile = '/Users/declanryan/Desktop/UNI3.0/year3/WEB/workshop/assgienment/air-quality-data-2003-2022.csv';
$outputDirectory = 'output';

// create the output directory if it doesn't exist
if (!file_exists($outputDirectory)) {
    mkdir($outputDirectory);
}

$handle = fopen($inputFile, 'r');

if ($handle) {
    $headerRow = null;
    $siteIDs = array();

    while (($data = fgets($handle)) !== false) {
        $row = str_getcsv($data, ";");

        if ($headerRow === null) {
            $headerRow = $row;
            continue; // Skip processing the header row
        }

        $siteID = $row[array_search('SiteID', $headerRow)];
        if (!in_array($siteID, $siteIDs)) {
            $siteIDs[] = $siteID;

            $newCSVFilename = $outputDirectory . '/data-' . $siteID . '.csv';
            $f = fopen($newCSVFilename, 'w');

            fputs($f, implode(';', $headerRow) . PHP_EOL);
        }

        $currentCSVFilename = $outputDirectory . '/data-' . $siteID . '.csv';
        $f = fopen($currentCSVFilename, 'a');

        fputs($f, implode(';', $row) . PHP_EOL);

        fclose($f);
    }

    fclose($handle);
}

echo "Done.";
echo "<br>";
echo 'Script took ' . round((microtime(true) - $_SERVER["REQUEST_TIME_FLOAT"]), 2) . ' seconds to run.';


?>

It has taken a fair enough time to even get the file processing to occur. I was going to change the format to getcsv but my lecture told me this method is actually slower?

in reply to Sammath, would something like this be more inline with what's necessary?


@date_default_timezone_set("GMT");

ini_set('memory_limit', '512M');
ini_set('max_execution_time', '300');
ini_set('auto_detect_line_endings', TRUE);
ini_set('display_errors', 1);

$inputFile = '/Users/declanryan/Desktop/UNI3.0/year3/WEB/workshop/assgienment/air-quality-data-2003-2022.csv';
$outputDirectory = 'output';

// create the output directory if it doesn't exist
if (!file_exists($outputDirectory)) {
    mkdir($outputDirectory);
}

$source = fopen($inputFile, 'r');
if (!$source) {
    exit('Unable to open input file.');
}

$headerRow = fgetcsv($source, 0, ';');
if (!$headerRow) {
    exit('Unable to read header row.');
}

$columnIndexes = array_flip($headerRow);
$siteIDColumn = $columnIndexes['SiteID'];

$handles = [];

while (($row = fgetcsv($source, 0, ';')) !== false) {
    $siteID = $row[$siteIDColumn];
    if (!isset($handles[$siteID])) {
        $newCSVFilename = $outputDirectory . '/data-' . $siteID . '.csv';
        $handles[$siteID] = fopen($newCSVFilename, 'w');
        if (!$handles[$siteID]) {
            exit('Unable to open output file for SiteID: ' . $siteID);
        }
        fputcsv($handles[$siteID], $headerRow, ';');
    }

    fputcsv($handles[$siteID], $row, ';');
}

foreach ($handles as $handle) {
    fclose($handle);
}

fclose($source);

echo "Done.";
echo "<br>";
echo 'Script took ' . round((microtime(true) - $_SERVER["REQUEST_TIME_FLOAT"]), 2) . ' seconds to run.';


grimx
  • 5
  • 3
  • 1
    Your code is definitely going to be slow if you're constantly re-opening file handles inside the loop, eg: by constantly reassigning `$f`. Open each handle _once_, store it, and refer to it later if necessary. – Sammitch May 16 '23 at 21:08
  • would the fclose(); work like that? or do you mean store within array? – grimx May 16 '23 at 21:14

2 Answers2

1

Touching the filesystem has a non-trivial amount of the slowest IO computers generally do, and PHP abstracts a lot of optimizations away. But when you repeatedly open a file, write a very small amount of data, and then close it, you're not only making those optimizations meaningless, but you're also doing about the worst thing you can do: constantly flushing tiny writes to disk.

For something like this you should be opening those handles once, which might look roughly like:

$source = fopen('somefile.csv', 'r');
$handles = [];

$header = fgetcsv($source, ';');

while( $row = fgetcsv($source) ) {
  $some_id = $row[X];
  if( ! key_exists($some_id, $handles) ) {
    $handles[$some_id] = fopen("foo/$some_id.csv", w);
  }
  fputcsv($handles[$some_id], $row, ';');
}

foreach($handles as $handle) {
  fclose($handle);
}
fclose($source);

Additionally, there is functionally little difference between fgetcsv() and str_getcsv(fgets()), but implode(';', $row) is not an appropriate CSV encode method as it will not perform any string quoting/escaping, etc.

Sammitch
  • 30,782
  • 7
  • 50
  • 77
  • I've added a new section at the bottom with what I think you mean? – grimx May 16 '23 at 21:31
  • will stream_set_chunk_size make it go any faster too? – grimx May 16 '23 at 21:48
  • @grimx frankly, while my own solution might be a few percents faster, *this* solution is the most correct *and portable* (think "changes of CSV structure"). You may upvote my solution, but you should accept *this* one in case someone else chances by. – LSerni May 17 '23 at 17:40
  • I see, it's more the effort you went though with compiling the answer and within my case, it's marked depending on the speed of the php script – grimx May 17 '23 at 18:12
  • 1
    But for general use case, I get the point, it's upvoted for others – grimx May 17 '23 at 18:14
0

Your execution times stem from three sources:

  • closing and reopening the files is very costly.
  • writing small amounts of data (one line) at a time is not efficient.
  • parsing CSV is not needed in this case, because all you want to know is the siteId of each single row, and the file format is known.

So for example:

define('BUFFERSIZE', 1048576);

$buffers = [];
$handles = [];

$start  = microtime(true);
$memory = memory_get_peak_usage(true);

$fp     = fopen("air-quality-data-2003-2022.csv", "r");
fgets($fp, 10240);
while(!feof($fp)) {
    $line = fgets($fp, 10240);
    if (empty($line)) {
        break;
    }
    [ , , , , $siteId ] = explode(';', $line);
    if (isset($handles[$siteId])) {
        if (strlen($buffers[$siteId]) > BUFFERSIZE) {
            fwrite($handles[$siteId], $buffers[$siteId]);
            $buffers[$siteId] = '';
        }
    } else {
        $handles[$siteId] = fopen("air-quality-{$siteId}.csv", "w");
        $buffers[$siteId] = '';
    }
    $buffers[$siteId] .= $line;
}
fclose($fp);

foreach ($handles as $siteId => $fp) {
    fwrite($fp, $buffers[$siteId]);
    fclose($fp);
}

print "Time elapsed: " . (microtime(true) - $start) . " seconds, memory = " . (memory_get_peak_usage(true) - $memory) . " bytes \n";

yields (on my system):

Time elapsed: 0.9726489448547 seconds, memory = 20971520 bytes

I've ran some experiments with different BUFFERSIZE's (reported memory is memory beyond that already allocated by the script).

Buffer = 4096, time elapsed: 1.3162 seconds, memory = 0 bytes
Buffer = 32768, time elapsed: 1.0094 seconds, memory = 0 bytes
Buffer = 131072, time elapsed: 0.9834 seconds, memory = 2097152 bytes
Buffer = 262144, time elapsed: 0.9104 seconds, memory = 4194304 bytes
Buffer = 500000, time elapsed: 0.9812 seconds, memory = 10485760 bytes
Buffer = 400000, time elapsed: 0.9854 seconds, memory = 8388608 bytes
Buffer = 300000, time elapsed: 0.9675 seconds, memory = 6291456 bytes
Buffer = 262144, time elapsed: 1.0102 seconds, memory = 4194304 bytes
Buffer = 262144, time elapsed: 0.9599 seconds, memory = 4194304 bytes

Note the variability (I should probably have rebooted or at least run sync and flushed the cache between tests), and the fact that it doesn't take much in the way of buffer to improve speed (after a certain point, efficiency will start decreasing again as PHP struggles to handle very large string concatenation). The actual size of buffer will depend on the underlying file system: if it is cache-backed, as mine is, chances are that large BUFFERSIZE's will not change things very much.

LSerni
  • 55,617
  • 10
  • 65
  • 107
  • You're a master, that is fast compared to whatever methods I've implemented. I knew some what of the reason's to why it was slow but I didn't have exact improvements to speed this process up despite my research. I gather that some where in the middle is the best buff size from your testing. I haven't turned off my latop in weeks and it still managed a nice 21 seconds. very nice work – grimx May 16 '23 at 22:32
  • 1
    I take two issues with this answer: 1. `explode()` is not suitable as a CSV parser as quoted and/or escaped separators are not taken into account, and will throw off your column indexes. 2. PHP does not necessarily write data when a call to a file write function is made, PHP will buffer that data itself for later, larger writes. Duplicating buffers into userspace can cause issues of its own. – Sammitch May 17 '23 at 00:20
  • @Sammitch you're correct on both accounts; I did say that *in this case* full CSV parsing is not needed. In the general case `explode()` would be a **bad choice**. – LSerni May 17 '23 at 14:08
  • occ 1,2,3 are empty for all lines within data-481.xml despite it having a output csv file with content so when I run it through xml convert script it's empty? – grimx May 17 '23 at 19:21
  • @grimx sorry, I do not follow. XML? And what is OCC? – LSerni May 18 '23 at 13:35