How to remove all duplicate lines from a list

Question

I have a text file (or Excel) with 200.000 email addresses (one per line). I need to remove all addresses that is duplicate including the original So this example:

a@a.com

b@a.com

c@a.com

d@a.com

should remove both the b@a.com entries. anyone know a way to do this without looping 200.000 lines

So you want to remove BOTH duplicate records and not keep any that occur more than once? — rwking, Nov 07 '13 at 15:02
Yes. both should be removed. I do not have MS Access. If anyone have some vba code I guess I can use that in excel. but was hoping for a free editor that can do it. — Thomas Adrian, Nov 07 '13 at 15:06

score 2 · Answer 1 · answered Nov 07 '13 at 15:09

2

Put the data in column A starting at A2

In B2 enter:

=COUNTIF(A:A,A2) and copy down

Turn on AutoFilter and for column B, un-check the 1 box (hides the unique items)

Delete the visible rows.

answered Nov 07 '13 at 15:09

Gary's Student

95,722
10
59
99

Beat you by 6 seconds! – mattboy Nov 07 '13 at 15:10

score 2 · Answer 2 · edited May 23 '17 at 12:29

Using the previous post as an example, I have modified the statement for you, it should create a new sheet with the data you want when you run the code. You will need to ensure that the ConnString line points to your file and change the word emailcolumn to be the name of the column containing your email addresses.

I have assumed that the sheet containing the email addresses is called Sheet1 and the data will go to column A of the new sheet (change the code if needs be):

Sub Excel_QueryTable()

Sheet2.Cells.ClearContents

Dim oCn As ADODB.Connection
Dim oRS As ADODB.Recordset
Dim ConnString As String
Dim SQL As String

Dim qt As QueryTable

ConnString = "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=C:\t.xlsm;Extended Properties=Excel 8.0;Persist Security Info=False"
Set oCn = New ADODB.Connection
oCn.ConnectionString = ConnString
oCn.Open

SQL = "Select emailcolumn from [Sheet1$] GROUP BY emailcolumn HAVING COUNT(emailcolumn) = 1"

Set oRS = New ADODB.Recordset
oRS.Source = SQL
oRS.ActiveConnection = oCn
oRS.Open

Set qt = Worksheets(2).QueryTables.Add(Connection:=oRS, _
Destination:=Range("A1"))

qt.Refresh

If oRS.State <> adStateClosed Then
oRS.Close
End If

If Not oRS Is Nothing Then Set oRS = Nothing
If Not oCn Is Nothing Then Set oCn = Nothing

End Sub

The nice thing about this is that you can use any valid ANSI SQL statement to do what you need, a bit overkill in this case perhaps but still worth mentioning. Plus you aren't killing your original data (in case you need to query something later) — twoleggedhorse, Nov 07 '13 at 15:14

score 1 · Answer 3 · answered Nov 07 '13 at 15:09

1

How about using COUNTIF to find the duplicates, then filtering them all out and deleting them way?

i.e. Assuming your emails are in column A, put this formula in cell B1 and copy it down to the bottom.

=COUNTIF(A:A; A1)>1

This formula would return TRUE if there's a duplicate. Filter out all the TRUEs and delete those rows!

answered Nov 07 '13 at 15:09

mattboy

2,870
5
26
40

@zx8754 : Would COUNTIFS() be faster?? – Gary's Student Nov 07 '13 at 15:17
@Gary'sStudent I don't think so, but someone with better knowledge of internals of Excel could comment. – zx8754 Nov 07 '13 at 15:22
@zx8754, @Gary'sStudent I got curious and tested it out with 220 000 rows. It took about 10-15 seconds with both `countif` and `countifs`. The `OR` solution was much faster, but that required sorting all the rows first which also took a while. – mattboy Nov 07 '13 at 15:40

score 1 · Answer 4 · answered Nov 07 '13 at 15:10

1

No need for VBA, simply sort the column and check if the cell has the same value as above cell or below, then use filter for TRUE. Something like this:

=OR(A2=A1,A2=A3)

answered Nov 07 '13 at 15:10

zx8754

52,746
12
114
209

How to remove all duplicate lines from a list

4 Answers4