4

I have a large string containing the contents of a CSV file. Up to now, I didn't care about parsing it, as my program was just streaming it from one source to another.

Your mission, should you to decide to accept it, is to tell me the best way of removing line breaks from the data elements of a string containing multiple CSV data rows, without throwing away the line breaks separating the rows themselves. The data is properly quoted, and the implementation must run on PHP 5.2...

id,data,other
1,"This is data
with a line break I want replacing",1
2,"This is a line with no line break in the data",0
3,No quotes,42
4,"Quoted field with ""quotes inside"" which is tricky",84
hakre
  • 193,403
  • 52
  • 435
  • 836
vogomatix
  • 4,856
  • 2
  • 23
  • 46
  • Can you elaborate on `removing line breaks without throwing away the line breaks`? Also an examples of the data and the expected result would improve your question in my opinion. – Andrius Naruševičius Apr 01 '14 at 11:03
  • Does every line contain a fixed number of fields, I mean you need some info to indicate a single line. E.g. every 5 commas we have a single line. – Melsi Apr 01 '14 at 11:08
  • All CSV data contains a fixed number of fields. :-). @AndriusNaruševičius example added – vogomatix Apr 01 '14 at 11:09
  • Still unclear what _“remove, but not throw away”_ is supposed to mean. Anyway, _your_ mission is RTFM, http://www.php.net/manual/en/function.fgetcsv.php – this will get you the data in the correct way. What you do with it afterwards, is up to you. – CBroe Apr 01 '14 at 11:12
  • The data comes from a webserver, not from a file, hence fgetcsv is not applicable – vogomatix Apr 01 '14 at 11:15
  • 2
    Possible duplicate of http://stackoverflow.com/questions/5470991/importing-csv-that-has-line-breaks-within-the-actual-fields – faintsignal Apr 01 '14 at 11:17
  • Not a duplicate, but the linked question does show interesting leads - thankyou. – vogomatix Apr 01 '14 at 11:20
  • Yeah, you're right, the linked question's OP was using a later version of PHP, my bad. I'm going to check later at work if the library we use handles this case properly as well. – faintsignal Apr 01 '14 at 11:32
  • Off topic, but if you're still on PHP 5.2, then you are *way* behind. Seriously, I know some companies find it difficult to upgrade, but PHP 5.2 has been unsupported for years now, and has a number of **serious** security bugs which have been fixed in newer versions but will never be patched in 5.2. If you're still running 5.2 on a production server, then you are putting your systems at risk. I strongly recommend upgrading as soon as possible. – Spudley Apr 01 '14 at 12:16
  • @Spudley Yep, unfortunately I have no influence on keeping up to date... :-( – vogomatix Apr 01 '14 at 13:31
  • @vogomatix: I accept that, and I know the feeling. But it's worth re-iterating because it is so important. For example, [this bug](http://blog.imperva.com/2014/03/threat-advisory-php-cgi-at-your-command.html): It allows anyone to run arbitrary code on your system. It was found and fixed in 5.3/5.4/5.5 but the fix was not made to 5.2. And that's just the most recent one; there's been three years for these issues to build up in 5.2. Any sys-admin who is prepared to accept that kind of vulnerability to persist on his network is being grossly negligent. – Spudley Apr 01 '14 at 13:40

1 Answers1

1

I think, if there is a line break inside of CSV data, there must be an odd (unpaired) number of quotation marks on that line. If there is such a line, remove the its line-break and check if the newly created line is valid. The following pseudo-PHP code should work. Things line Reader and containsOddNumberOfQuotes() are easy to implement in PHP 5.2:

function fixCsv($fileOrString) {
    $reader = new Reader($fileOrString);
    $correctCsv = "";
    while ($reader->hasMoreLines()) {
        $correctCsv = $correctCsv . fixLine($reader, $reader->readLine()) . "\n";
    }
    return $correctCsv;
}

/** Recursive function that returns a valid CSV line. */
function fixLine($reader, $line) {
    if (containsOddNumberOfQuotes($line)) {
        if ($reader->hasMoreLines()) {
            // Try to make a valid CSV line by joining this line with the next one.
            return fixLine($reader, line . $reader->readLine())
        }
        throw new Exception("Last line is incomplete.");
    }
    else {
        return $line;
    }
}
Ján Halaša
  • 8,167
  • 1
  • 36
  • 36