de-duplicate a list of strings

Question

I very commonly run into this issue:

I have a csv file with a list of data in it
I need to remove duplicates (or sometimes, find the values that are duplicated)

The csv is easy to bring into excel, but I can't find (or, never remember) a good method to find/remove/count duplicate values.

I can export the data to a database and run some simple sql queries to do all this, but then the database gets in the way with most other operations and I wind up exporting the data back out to excel to do cell level work.

Is there not some tool that can make working with tabular data as easy as excel, but include the more powerful features of a sql query language?

I've also found this tool online which works with simplifies lists of duplicates inside the browser, but of course it's online and still not the best option: http://www.amazify.com/find-duplicates-in-a-list-of-strings — Brady Moritz, Jul 24 '14 at 22:31
I'm "describing the problem and what I've done so far to solve it" and I'm not happy with my solutions so far. thanks — Brady Moritz, Jul 24 '14 at 22:38

brettdj · Answer 1 · 2014-07-25T01:15:47.783

Use PowerShell

Directly, or less efficiently via VBA

Update: I note this assumes you have two columns in your csv with headers of First andLast - this should be clear in the examples below as to where you need to update it

vba

 Sub Better()
 X = Shell("powershell.exe  Import-Csv C:\temp\test.csv | sort First,Last  -unique | Export-Csv  C:\temp\test2.csv -NoTypeInformation", 1)
 End Sub

powershell

Import-Csv C:\temp\test.csv | sort First,Last  -unique | Export-Csv  C:\temp\test2.csv -NoTypeInformation

score 1 · Answer 2 · edited Jul 26 '14 at 10:27

1

Two options:

Option 1

Data tab
Click: Remove duplicates
Choose the column(s) you want to remove duplicates from

Option 2

Use the following formula in Excel: =COUNTIF($A$2:$A$100,A2)

This formula assumes the values to check are in column A, start on row 2, and extend to row 100. Adjust as needed.

You can drag this formula down to the bottom of your data set and then filter for values greater than 1. These will show duplicate values, and you can delete them at your choosing.

edited Jul 26 '14 at 10:27

sina72

4,931
3
35
36

answered Jul 25 '14 at 00:09

Don S

231
2
9

will option 1 work with removing "rows" of data wider than the single column? – Brady Moritz Oct 06 '14 at 00:02

score 1 · Answer 3 · edited Jul 26 '14 at 10:24

To add another method, you can use Pivot table for this purpose.

Create a pivot table with the record, drag the name of the column that you want unique records into the "Row Labels" field of the pivot table.

if you like keyboard shortcut:

Click anywhere in the csv data region
Keyboard entries in the order: Alt+D, Alt+P, Alt+F
Drag the column name of the intended column into Row Labels.

score 0 · Answer 4 · answered Jul 24 '14 at 22:51

In Excel, you can do the following:

Go to the Data tab
Choose Filter-->Advanced Filter
Check Unique records only

Then you can copy the list somewhere else by highlighting it, copying, and pasting.

Moving the data to a database to do this is overkill, unless there is additional processing you want done.

panofish · Answer 5 · 2014-07-25T19:29:17.160

In my opinion, nothing is better than using MySQL data directly... and the closest thing to Excel cell editing for MySQL is to use the free Heidisql.

It is an amazing application and I use it all the time. It is the best thing for MySQL, because you can take full advantage of queries and excellent MySQL database management. Much more flexible and fast than alternatives like phpmyadmin. To improve some remote host connection speeds... just be sure to uncheck "Get full table status" under the advanced tab in the session manager.

And if you need a fast and easy way to get your Excel data into MySQL, I wrote a program that is free to try that will transfer the Excel data into MySQL called Excel2MySQL.

de-duplicate a list of strings

5 Answers5