0

I have several files to parse (with PHP) in order to insert their respective content in different database tables.

First point : the client gave me 6 files, 5 are CSV with values separated by coma ; The last one do not come from the same database and its content is tabulation-based.

I built a FileParser that uses SplFileObject to execute a method on each line of the file-content (basically, create an Entity with each dataset and persist it to the database, with Symfony2 and Doctrine2).

But I cannot manage to parse the tabulation-based text file with SplFileObject, it does not split the content in lines as I expect it to do...

// In my controller context
$parser = new MyAmazingFileParser();
$parser->parse($filename, $delimitor, function ($data) use ($em) {
    $e = new Entity();
    $e->setSomething($data[0);
    // [...]
    $em->persist($e);
});

// In my parser
public function parse($filename, $delimitor = ',', $run = null) {
    if (is_callable($run)) {
        $handle = new SplFileObject($filename);
        $infos = new SplFileInfo($filename);

        if ($infos->getExtension() === 'csv') {
            // Everything is going well here
            $handle->setCsvControl(',');
            $handle->setFlags(SplFileObject::DROP_NEW_LINE + SplFileObject::READ_AHEAD + SplFileObject::SKIP_EMPTY + SplFileObject::READ_CSV);
            foreach (new LimitIterator($handle, 1) as $data) {
                $result = $run($data);
            }
        } else {
            // Why does the Iterator-way does not work ?
            $handle->setCsvControl("\t");
            // I have tried with all the possible flags combinations, without success...
            foreach (new LimitIterator($handle, 1) as $data) {
                // It always only gets the first line...
                $result = $run($data);
            }
            // And the old-memory-killing-dirty-way works ?
            $fd = fopen($filename, 'r');
            $contents = fread($fd, filesize($filename));
            foreach (explode("\t", $contents) as $line) {
                // Get all the line as I want... But it's dirty and memory-expensive !
                $result = $run($line);
            }
        }
    }
}

It is probably related with the horrible formatting of my client's file, but after a long discussion with them, they really cannot get another format for me, for some acceptable reasons (constraints in their side), unfortunately.

The file is currently long of 49459 lines, so I really think the memory is important at this step ; So I have to make the SplFileObject way working, but do not know how.

An extract of the file can be found here : Data-extract-hosted

Flo Schild
  • 5,104
  • 4
  • 40
  • 55
  • Why you don't use `fgetcsv`? – hindmost Apr 11 '14 at 18:36
  • Because it is not a csv file, unfortunately – Flo Schild Apr 11 '14 at 18:40
  • `fgetcsv` (http://www.php.net/manual/en/function.fgetcsv.php) allows to parse not only standard `CSV`. Check its parameters out – hindmost Apr 11 '14 at 18:45
  • Okay, I didn't know but it does not works with SplFileObject nor Iterators I guess, does it ? – Flo Schild Apr 11 '14 at 18:52
  • `SplFileObject` has its own counterpart of `fgetcsv`: http://www.php.net/manual/en/splfileobject.fgetcsv.php – hindmost Apr 11 '14 at 19:04
  • I tried it but it only gets the first line too – Flo Schild Apr 11 '14 at 19:08
  • 1
    it may help to provide part or all of the file somewhere that we can download and then test it ourselves. obviously remove all the sensitive info. Just a few lines will be sufficient. – Ryan Vincent Apr 11 '14 at 21:23
  • I've updated the question with a link for the extract ! Thanks in advance ! – Flo Schild Apr 11 '14 at 22:42
  • 1
    I have had a look at the file. It is a really regular format: Fields are delimited by the tab character (0x09 - decimal(9), the end-of-record is a 'carriage-return' character (0x0D, Decimal(13)). I changed the file extension to 'csv' and asked 'open office'to load it in 'calc' switch off all the options except tab. I then used a hex editor to look at the data. Thanks for the data file extract. made it easy to see the issues. – Ryan Vincent Apr 12 '14 at 18:54
  • Okay, I was suspecting such an issue after some other failing tries... I will just let it know at my client. Thanks a lot for trying and helping ! – Flo Schild Apr 13 '14 at 17:45
  • Should have made it clear - open office loads it fine with the setting i gave. I suspect that any of the csv file loaders with the settings i gave would load it fine. – Ryan Vincent Apr 14 '14 at 16:05
  • How do you parse any CSV with a "decimal-defined" character ? – Flo Schild Apr 16 '14 at 13:21

0 Answers0