I have been having an issue for a little bit of time now and cannot find a solution that has worked for me. It might be that I am just not doing it correctly or that there is an alternative that will work better. I am opened to, and appreciative, of any ideas.
I have a table (tblDocQueue) in access that is like the one displayed below. The data comes from a data extract of an older application that we use at work, so the source or the extract cant be changed. We upload the data to Access to analyze and build metrics around it. The tables is as follows:
ID DocName OwnerName AccountNum DocRef
1 Doc 1 Matt 1001 Z0005638
2 Doc 1 Matt 1002 Z0005638
3 Doc 1 Tony 5010 Z0005639
4 Doc 2 Luke 1050 Z0005640
5 Doc 3 Luke 1050 Z0005641
6 Doc 3 Gary 1234 Z0005641
7 Doc 4 John 8789 Z0005642
8 Doc 5 Ed 8789 Z0005642
9 Doc 5 Ed 8790 Z0005643
10 Doc 5 Connie 4579 Z0005644
11 Doc 6 Mary 3616 Z0005645
12 Doc 6 Lucy 4795 Z0005646
13 Doc 6 Tina 4795 Z0005646
14 Doc 7 Matt 1001 Z0005638
15 Doc 7 John 8789 Z0005647
There are more columns than what are listed, but they are not relevant to the question. I am trying to remove duplicates, keeping one unique value, based on three columns (DocName, OwnerName, Doc Ref). I was doing this using the following SQL, but it began taking hours (up to 7) to process around 500,000 lines of data. I am unsure if the efficiency problem is because of using min/max or some other reason
SELECT tblDocQueue.ID AS Expr1, tblDocQueue.DocName AS Expr2,
tblDocQueue.OwnerName AS Expr3, tblDocQueue.AcctNo AS Expr4,
tblDocQueue.ExpDate AS Expr5, tblDocQueue.EffectiveDate AS Expr6,
tblDocQueue.SignatureDate AS Expr7, tblDocQueue.DocBNYSts AS Expr8,
tblDocQueue.StsDate AS Expr9, tblDocQueue.UserSts AS Expr10,
tblDocQueue.Location AS Expr11, tblDocQueue.Ackngmt AS Expr12,
tblDocQueue.OPID AS Expr13, tblDocQueue.Comments AS Expr14,
tblDocQueue.DocRef AS Expr15, tblDocQueue.ExternalComment AS Expr16,
tblDocQueue.FirstName AS Expr17, tblDocQueue.LastName AS Expr18,
tblDocQueue.ClientID AS Expr19, tblDocQueue.Address AS Expr20,
tblDocQueue.CountryCode AS Expr21
FROM tblDocQueue
WHERE ((([tblDocQueue].[ID])=(
SELECT Min(t.[ID])
FROM [tblDocQueue] AS t
WHERE t.[DocRef]=[tblDocQueue].[DocRef]
AND t.[DocName]=[tblDocQueue].[DocName])));
The time that this it was taking was unacceptable for the business team. I then developed a work around in VBA that exports the data to an excel file, using the remove duplicates function preloaded in excel, and imported the unique data back into a different table. This only takes a few seconds in excel. As the use of this database is beginning to expand, and I will removing duplicates in a similar way for hundreds of datasets a day, I am trying to get this to work without using the above workaround.
The anticipated result of the above example data would be:
ID DocName OwnerName AccountNum DocRef
1 Doc 1 Matt 1001 Z0005638
3 Doc 1 Tony 5010 Z0005639
4 Doc 2 Luke 1050 Z0005640
5 Doc 3 Luke 1050 Z0005641
6 Doc 3 Gary 1234 Z0005641
7 Doc 4 John 8789 Z0005642
8 Doc 5 Ed 8789 Z0005642
9 Doc 5 Ed 8790 Z0005643
10 Doc 5 Connie 4579 Z0005644
11 Doc 6 Mary 3616 Z0005645
12 Doc 6 Lucy 4795 Z0005646
13 Doc 6 Tina 4795 Z0005646
14 Doc 7 Matt 1001 Z0005638
15 Doc 7 John 8789 Z0005647
If anyone can help me with the SQL to:
- Conditionally remove duplicate values based on three columns
- Using Microsoft Access 2010
- While keeping one row of each unique value
- In some way that is efficient / does not take a long amount of time (tables up to 5,000,000 records)
Any and all help is greatly appreciated!