0

I am reading a CSV file but some of the values are not escaped so PHP is reading it wrong. Here is an example of a line that is bad:

" 635"," ","AUBREY R. PHILLIPS (1920- ) - Pastel depicting cottages in a steep sided river valley, possibly North Wales, signed and dated 2000, framed, 66cm by 48cm. another of a rural landscape, titled verso "Harvest Time, Somerset" signed and dated '87, framed, 69cm by 49cm. (2) NB - Aubrey Phillips is a Worcestershire artist who studied at the Stourbridge School of Art.","40","60","WAT","Paintings, prints and watercolours",

You can see Harvest Time, Somerset has quotes around it, causing PHP to think its a new value.

When i do print_r() on each line, the broken lines end up looking like this:

Array
(
    [0] =>  635
    [1] =>  
    [2] => AUBREY R. PHILLIPS (1920- ) - Pastel depicting cottages in a steep sided river valley, possibly North Wales, signed and dated 2000, framed, 66cm by 48cm. another of a rural landscape, titled verso Harvest Time
    [3] => Somerset" signed and dated '87
    [4] => framed
    [5] => 69cm by 49cm. (2)  NB - Aubrey Phillips is a Worcestershire artist who studied at the Stourbridge School of Art."
    [6] => 40
    [7] => 60
    [8] => WAT
    [9] => Paintings, prints and watercolours
    [10] => 
)

Which is obviously wrong, as it now contains many more array elements than other correct rows.

Here is the PHP i am using:

$i = 1;
if (($file = fopen($this->request->data['file']['tmp_name'], "r")) !== FALSE) {
    while (($row = fgetcsv($file, 0, ',', '"')) !== FALSE) {
        if ($i == 1){
            $header = $row;
        }else{
            if (count($header) == count($row)){
                $lots[] = array_combine($header, $row);
            }else{
                $error_rows[] = $row;
            }

        }
        $i++;
    }
    fclose($file);
}

Rows with the wrong amount of values get put into $error_rows and the rest get put into a big $lots array.

What can I do to get around this? Thanks.

472084
  • 17,666
  • 10
  • 63
  • 81
  • 1
    Should post your code on how you parse the CSV, There maybe mistakes there, but can't help or tell you that without your code. – Churk Mar 16 '12 at 11:42

5 Answers5

1

If you know that you'll always get entries 0 and 1, and that the last 5 entries in the array are always correct, so it's just the descriptive entry that's "corrupted" because of unescaped enclosure characters, then you could extract the first 2 and last 5 using array_slice(), implode() the remainder back into a single string (restoring the lost quotes), and rebuild the array correctly.

$testData = '" 635"," ","AUBREY R. PHILLIPS (1920- ) - Pastel depicting cottages in a steep sided river valley, possibly North Wales, signed and dated 2000, framed, 66cm by 48cm. another of a rural landscape, titled verso "Harvest Time, Somerset" signed and dated \'87, framed, 69cm by 49cm. (2) NB - Aubrey Phillips is a Worcestershire artist who studied at the Stourbridge School of Art.","40","60","WAT","Paintings, prints and watercolours",';

$result = str_getcsv($testData, ',', '"');

$hdr = array_slice($result,0,2);
$bdy = array_slice($result,2,-5);
$bdy = trim(implode('"',$bdy),'"');
$ftr = array_slice($result,-5);

$fixedResult = array_merge($hdr,array($bdy),$ftr);
var_dump($fixedResult);

result is:

array
  0 => string ' 635' (length=4)
  1 => string ' ' (length=1)
  2 => string 'AUBREY R. PHILLIPS (1920- ) - Pastel depicting cottages in a steep sided river valley, possibly North Wales, signed and dated 2000, framed, 66cm by 48cm. another of a rural landscape, titled verso Harvest Time" Somerset" signed and dated '87" framed" 69cm by 49cm. (2) NB - Aubrey Phillips is a Worcestershire artist who studied at the Stourbridge School of Art.' (length=362)
  3 => string '40' (length=2)
  4 => string '60' (length=2)
  5 => string 'WAT' (length=3)
  6 => string 'Paintings, prints and watercolours' (length=34)
  7 => string '' (length=0)

Not perfect, but possibly good enough

The alternative is to get whoever is generating the csv to properly escape their enclosures

Mark Baker
  • 209,507
  • 32
  • 346
  • 385
  • This does seem a bit hacky, but if its the only way. Hopeful I wont get "corrupted" data in other fields lol! One small issue, yours contains `titled verso Harvest Time" Somerset" signed and dated` when the original is `titled verso "Harvest Time, Somerset" signed and dated` – 472084 Mar 16 '12 at 12:05
1

If you can ecape the " in the text like this: \"

and the in fgetcsv use specify th escape char

fgetcsv($file, 0, ',', '"','\');
ab_dev86
  • 1,952
  • 16
  • 21
0
$csv = explode(' ', $csv);
foreach ($csv as $k => $v) if($v[0] == '"' && substr($v, -1) == '"') {
    $csv[$k] = mb_convert_encoding('“' . substr($v, 1, -1) . '”', 'UTF-8', 'HTML-ENTITIES');
}
$csv = implode(' ', $csv);
$csv = str_getcsv($csv);
0

This is a long shot so don't take i to seriously.

I saw a pattern in the text that all the ',' you want to ignore has a space after it. Search and replace ', ' with 'FUU' or something unique.

Now parse the csv file. It might get the correct format. You only need to replace 'FUU' back to ', '

:)

heldt
  • 4,166
  • 7
  • 39
  • 67
0

You are probably reading the contents of the CSV file as an array of lines, then splitting each line on the comma. This fails since some of the fields also contain commas. One trick that could help you out is to look for ",", which would indicate a field separator which would be unlikely (but not impossible, unfortunately) to occur inside a field.

<?php
  $csv = file_get_contents("yourfile.csv");
  $lines = split("\r\n", $csv);
  echo "<pre>";
  foreach($lines as $line)
  {
    $line = str_replace("\",\"", "\"@@@\"", $line);
    $fields = split("@@@", $line);
    print_r($fields);
  }
  echo "</pre>";
?>
Alexander van Oostenrijk
  • 4,644
  • 3
  • 23
  • 37