6

I am building a CSV export tool. It's been working well for months, but we've recently encountered a couple of cases where there are extra lines caused by a cell "exploding".

I've narrowed the problem down to a bug in PHP (7.2.21 and others). I need to work around this. Below is the PHP script to reproduce the problem. It includes commas and linebreaks in each cell:

<?php
$data = array(
    array( 'ID', 'Name', 'Content', 'Date' ),
    array( 34, 'Radley', 'This is <strong>bold text</strong>, and' . "\r\n" . 'a second line, the first time', '2019-08-23' ),
    array( 47, 'John', 'This a <a href=\"http://example.org/\">link</a>' . "\r\n" . 'a second line, again', '2019-08-24' ),
    array( 65, 'Bob', 'This plain text, with no html, and lots of commas'. "\r\n" . 'and a third extra row', '2019-08-25' ),
);

$fh = fopen('php://output', 'w');
foreach( $data as $row ) fputcsv( $fh, $row );
echo stream_get_contents( $fh );
fclose( $fh );
?>

If we print_r the $data we can see this is only four values:

[2] => Array (
    (
        [0] => 47
        [1] => John
        [2] => This a <a href=\"http://example.org/\">link</a>
a second line, again
        [3] => 2019-08-24
    )
)

So I would expect this code to generate four rows with four columns each, but the third item by "John" is only 3 columns and adds an extra row with three columns as well.

badly formatted csv

The problem is not just that \" is already escaped. Having that in the string disables fputcsv from escaping EVERYTHING ELSE.

This malformed data can be seen in Excel for Windows and Mac, and also Google Sheets in Windows Chrome, and probably everything else. But if you load it back in to PHP with fgetcsv it works, so PHP must do this intentionally?


I need to correct this bug so that \" gets escaped properly into \"", and so all other commas and double-quotes get escaped as well. The plugin I am building exports content generated by other plugins, and those might have escaped data that needs to stay escaped with their own mechanisms.

Is there any way to ensure that each cell gets escaped, even if the cell has backslashes and quotes?

Radley Sustaire
  • 3,382
  • 9
  • 37
  • 48
  • 1
    It's seen in Google Sheets in Linux FF and Chrome also, only LibreOffice survived this and show the data properly. – Accountant م Aug 24 '19 at 01:39
  • 1
    Question is (for me), how are those slashes getting in there in the first place? Why keep them and if not, you can use a function to remove them altogether. I can't see why you would want to keep an invalid HTML format for a URL, any special reason? Edit: Ok I saw the *"The plugin I am building exports content generated by other plugins, and those might have escaped data that needs to stay escaped with their own mechanisms."* which answered one of my questions. However, you could get rid of them if you want, before they're ever written to file. – Funk Forty Niner Aug 24 '19 at 01:47
  • @FunkFortyNiner I appreciate the concern, the unnecessary escaping bugs me too! But it's not something to worry about here. This export feature specifically needs to have the same formatting so that you can import the file back in to the system. The field with the escape double quotes is not from my code. – Radley Sustaire Aug 24 '19 at 05:13
  • @RadleySustaire Righto :) Glad to see you've gotten the (right) answer you were looking for. – Funk Forty Niner Aug 24 '19 at 13:48

1 Answers1

8

The PHP manual stated that

If an enclosure character is contained in a field, it will be escaped by doubling it, unless it is immediately preceded by an escape_char.

And this is your case! you used fputcsv with it's default parameters, the " as the enclosure and \ as the escape_char, so the function left \" and didn't turn it to \""

How to fix ?

If you are using PHP >= 7.4.0 then just disable the proprietary escape mechanism. by supplying an empty string in the escape_char parameter

 fputcsv( $fh, $row, ',', '"' , '');

if you are before 7.4.0 then searching has shown me this workaround : pass "\0" as the escape_char parameter. (I tested it and it worked on your example)

 fputcsv( $fh, $row, ',', '"' , "\0");

This is the widely used hack for this problem according to my search, however Christoph M. Becker stated in his proposal "Kill CSV escaping" ( looks like you are not the only one who is angry with fputcsv CSV escaping :D )

While in many cases passing “\0” as $escape parameter will yield the desired results, this won't work if someone is writing/reading binary CSV files, may have issues with some non ASCII compatible encodings, and is generally to be regarded as a hack.

What about the standards ?

RFC 4180 about CSV files has stated

If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote

There is nothing called escaping in standard CSV !! just a double quotes as the enclosure and escape it if it appears in the string with another double quotes. May be PHP wanted to support non-standard CSV files, I don't know!

After some searching and testing, it turns out this is a known issue with CSV files PHP functions fgetcsv and fputcsv. here are some other links I found during the search you might find it interesting:

Paul Dixon
  • 295,876
  • 54
  • 310
  • 348
Accountant م
  • 6,975
  • 3
  • 41
  • 61
  • Excellent solution and explanation! Thank you, it worked perfectly! This was driving me mad and I'm glad I'm not just crazy. It did seem like PHP was making up some other rule that no other software was following. Strange! – Radley Sustaire Aug 24 '19 at 05:06
  • Yes, it looks like this a strange behavior from fputcsv – Accountant م Aug 24 '19 at 05:10