1

I have CSV file that looks like this:

account, name, email,
123, John, dsfs@email.com
123, John, dsfs@email.com
1234, Alex, ala@email.com

I need to remove duplicate rows.I try to do it like this:

$inputHandle = fopen($inputfile, "r");
$csv = fgetcsv($inputHandle, 1000, ",");

$accounts_unique = array();

$accounts_unique = array_unique($csv);  

print("<pre>".print_r($accounts_unique, true)."</pre>");

But I get in print_r only first headers row. What needs to be done in order to make sure I 1. I clean the CSV file from duplicate rows 2. I can make some list of those duplicates (maybe store them in another CSV?)

Alex
  • 21
  • 1
  • 3

3 Answers3

4

Simple solution, but it requires a lot of memory if file is really big.

$lines = file('csv.csv');
$lines = array_unique($lines);
file_put_contents(implode(PHP_EOL, $lines));
sectus
  • 15,605
  • 5
  • 55
  • 97
  • Hmm, I think I need some more logic there...How can I make note of duplicate rows? – Alex Jul 01 '13 at 14:10
  • and btw, the duplicates are not removed when I run this – Alex Jul 01 '13 at 14:12
  • @sectus -- just suggesting that you might want to use `array_keys(array_flip())` or `array_flip(array_flip())` rather than `array_unique()`, given the significant performance difference. @Alex -- `array_diff_key($before, $after)` will give you the dropped item keys if you used `array_unique()` or `array_flip(array_flip())`. – Jacob S Jul 01 '13 at 18:06
  • @Alex, sorry, changed answer (added `$lines = `) – sectus Jul 02 '13 at 00:10
1

I would go this route, which will be faster than array_unique:

$inputHandle = fopen($inputfile, "r");
$csv = trim(fgetcsv($inputHandle, 1000, ","));
$data = array_flip(array_flip($csv)); //removes duplicates that are the same
$dropped = array_diff_key($csv, $data); //Get removed items.

Note -- array_unique() and array_flip(array_flip()) will only match for duplicate lines that are exactly the same.

Updated to include information from my comments.

Jacob S
  • 1,693
  • 1
  • 11
  • 12
1

If you are going to loop the data from the CSV anyway I think it would be best to do something like this.

$dataset = array();
foreach($line as $data){
    $dataset[sha1($data)] = $data;
}