-1

Some CSV files that we import to our server cannot be parsed correctly.

We are reading the CSV file with PHP's fgetcsv():

while (($line = fgetcsv($file)) !== false) { ... }

However, when the CSV line is wrapped in quotes (and contains two double quotes inside), for example:

"first entry,"""","""",Data Chunk,2022-05-30"

The fgetcsv() function cannot handle the line correctly and sees the first entry,"""","""",Data Chunk,2022-05-30 as one entry.

How can we make sure the function does regard first entry as a separate entry, and also interpretes the other parts """" as empty entries?


On more research I found:

Fields containing double quotes ("), Line Break (CRLF) and Comma must be enclosed with double quotes.

If Fields enclosed by double quotes (") contain double quotes character then the double quotes inside the field must be preceded with another double quote as an escape sequence. Source

This is likely the issue that we face here.


A more complete data example of the CSV:

Allgemeines
Subject,Body,Attachment,Author,Created At,Updated At
"Hello everyone, this is a sample. Kind regards,"""","""",Author name (X),2022-05-30 14:54:32 UTC,2022-05-30 14:54:37 UTC"
","""",https://padlet-uploads.storage.googleapis.com/456456456/testfile.docx,Author name (X),2022-05-15 13:53:04 UTC,2022-05-15 13:54:40 UTC"
",""Hello everyone!"

This is some fun text.
More to come.
Another sentence.
And more text.

Even more text

See you soon.




","",Author name (X),2021-07-22 09:41:06 UTC,2021-07-23 16:12:42 UTC
""
Important Things to Know in 2022
Subject,Body,Attachment,Author,Created At,Updated At
"","

01.01.2022 First day of new year
02.02.2202 Second day of new year

Please plan ahead.
","",Author name (X),2021-07-22 09:58:19 UTC,2022-03-24 14:16:50 UTC
""

Note: Line starts with double quote and ends with double quote and carriage return and new line feed.

Avatar
  • 14,622
  • 9
  • 119
  • 198
  • Perhaps you should escape the quotes as the documentation for `escape` parameter seems to hint as per my personal interpretation `Note: Usually an enclosure character is escaped inside a field by doubling it; however, the escape character can be used as an alternative. So for the default parameter values "" and \" have the same meaning. Other than allowing to escape the enclosure character the escape character has no special meaning; it isn't even meant to escape itself.` – Bijay Regmi May 30 '22 at 20:32
  • The CSVs are external files, we do not export them ourselves. – Avatar May 30 '22 at 20:32
  • 4
    Well if your data does not conform to the rules of a CSV, you cannot use `fgetcsv()` to read it. You might consider pre-parsing the data to reformat all the bits that break the rules – RiggsFolly May 30 '22 at 20:33
  • That's what I thought. Preparsing the file myself. I updated the CSV example above. Maybe someone else has an idea for a quicker fix or can see what the problem is. – Avatar May 30 '22 at 20:34
  • 1
    Wouldnt flat out replacing `""""` with something more parsable be easier? – Bijay Regmi May 30 '22 at 20:39
  • I compared the CSV with another "clean" CSV. It is strange that the entire line is wrapped in double quotes. It is the first time I am workign with CSV and fgetcsv(), but for me the data looks *corrupted*. – Avatar May 30 '22 at 20:42
  • 2
    It was valid CSV until your updated your question to wrap it in another layer of double quotes without escaping the quotes inside. You should ask your data provider to perhaps _not_ perpetrate this nightmare upon you. – Sammitch May 30 '22 at 20:42
  • Thanks. The CSV provider is a multi-million company. I think there is little chance. But I will try. – Removing each single quote and keeping the `""""` as `""` would make it valid. I will write a parser... – Avatar May 30 '22 at 20:43
  • Show us a REAL example of the file, more than one line. Its ok to change any secret info but DONT alter the format – RiggsFolly May 30 '22 at 20:46
  • The line above at "example" is the CSV line with the exact formatting. – I will write a custom parser to get rid of the problem, and post the solution later here. – Avatar May 30 '22 at 20:47
  • 1
    Well half your issue is that not a CSV :) Looks like you might have to read a line then do a bit of text manipulation and an explode on `,` to get your data split into usable individually referencable fields – RiggsFolly May 30 '22 at 20:51
  • Although I cannot even create a PHP variable of that string like this to test some text manipulation `$not_a_csv_line = "first entry,"""","""",Data Chunk,2022-05-30";` Are you completely sure thats the format of the file, because it looks unusable in any language I can think of – RiggsFolly May 30 '22 at 20:53
  • Yes, it is like this, directly copied from the CSV textfile. Here another example of one of those lines: `"Dear xyz,"""","""",My name (extra),2022-05-23 13:54:32 UTC"` – Avatar May 30 '22 at 20:56
  • 2
    Well because you cannot have a double quote inside a double quoted string you are going to have to ask the `multi-million company` to explain how you are supposed to process this data, or maybe provide you an option to download it as JSON for example. _There is no rule that says even the rich guys cant screw it up from time to time_ – RiggsFolly May 30 '22 at 21:09
  • 1
    At least I found some information about the formatting [here](https://phppot.com/php/how-to-handle-csv-with-php-read-write-import-export-with-database/) now: "Fields containing double quotes (“), Line Break (CRLF) and Comma must be enclosed with double quotes." AND "If Fields enclosed by double quotes (“) contain double quotes character then the double quotes inside the field must be preceded with another double quote as an escape sequence." – This is likely the issue that I face. – Avatar May 31 '22 at 05:48
  • 1
    Yes, this is "valid" CSV, *if* you expect to have a single column (with a value enclosed in quotes, which contains comma-separated values). Perhaps it has been double-CSV-encoded, and you need to CSV-parse that one column value again using a CSV parser. – deceze May 31 '22 at 06:36
  • I have added a more complete data example of the CSV above. It would be helpful if someone has an idea to parse the data. – Avatar May 31 '22 at 06:50

1 Answers1

1

Turns out the CSV data was corrupted.

The user messed around with the CSV in Excel, and as stated in the comments, likely overwrote the original CSV. Causing double escapings.

For anyone facing the same issue:

  1. Do not waste your time in trying to recover corrupted CSV files with a custom parser.

  2. Ask your user to give you access to the original CSV export site and generate the CSV yourself.

  3. Check the CSV integrity. See code below.

    $file = fopen($csvfile, 'r');

    // validate if all the records have same number of fields, empty lines (count 1), full entry (count 6) - depends on your CSV structure

    $length_array = array();

    while (($data = fgetcsv($file, 1000, ",")) !== false) 
    {
        // count number of entries
        $length_array[] = count($data);
    };

    $length_array = array_unique($length_array);

    // free memory by closing file
    fclose($file);
    
    // depending on your CSV structure it is $length_array==1 or $length_array==2
    if (count($length_array) > 2) 
    {
        // count mismatch
        return 'Invalid CSV!';
    }

Avatar
  • 14,622
  • 9
  • 119
  • 198