De-duplication is the process of removing duplicated or redundant data from a database.
Questions tagged [deduplication]
139 questions
2
votes
1 answer
Python list of tuples deduplication
I am trying to deduplicate a set of different lists of tuples one after other. The lists look like:
A = [
(('X','Y','Z',2,3,4), ('A','B','C',5,10,11)),
(('A','B','C',5,10,11), ('X','Y','Z',2,3,4)),
(('T','F','J',0,1,0),…

user2658190
- 45
- 3
2
votes
3 answers
Remove all duplicates from mysql table
I have a table which shows product ID's and how many times they have been given, 1 star, 2 stars, 3 stars, 4 stars and 5 stars when reviewed by customers along with the average rating for that product. There are some duplicate rows appearing in this…

Ben Paton
- 1,432
- 9
- 35
- 59
2
votes
2 answers
Google Script to remove duplicate rows based on 2 columns criteria
I am using a script in the that pulls events details from a calendar and adds them into column A and B in a spreadsheet, remove any duplicate events and then sort based on date. My hope is then that I can have staff add additional data about these…

DMarx
- 357
- 1
- 5
- 8
2
votes
2 answers
Deduplicaton / matching in Couchdb?
I have documents in couchdb. The schema looks like below:
userId
email
personal_blog_url
telephone
I assume two users are actually the same person as long as they have
email or
personal_blog_url or
telephone
be identical.
I have 3 views created,…

greeness
- 15,956
- 5
- 50
- 80
2
votes
1 answer
Removing tables with a join in SQL Server
I'm new to this DBA thing and I've been tasked with removing duplicates from a couple of tables. I'm working in SQL Server. They all have a field called LAST_UPD that tracks their last update. All the tables join to TABLE1 and each user is…

Fear605
- 35
- 5
2
votes
3 answers
Detecting a duplicate customer
I have a bunch of customer data that is normalized into multiple tables. I want to decide the best criteria for make a best guess that a customer might be the same. There needs to be a balance between minimizing the number of duplicates but also…

Christopher Martin
- 927
- 1
- 7
- 9
1
vote
4 answers
Best way or algorithm to near duplicate check against huge list of files?
I am using python to neardupe huge list of file (over 20000 ) files. Totaling about 300 MB
Current way is to do near-dupe checking using difflib's SequenceMatcher and getting result using QuickRatio .
With 4 worker process it takes 25 hours to get…

Phyo Arkar Lwin
- 6,673
- 12
- 41
- 55
1
vote
4 answers
Dedupe records without DELETE
I need to bring back only one of the records from a duplicated row in SQL Server
I have data like this
-------------------------------------------
CustomerID, OrderID, ProductID,…

Sandeep Bansal
- 6,280
- 17
- 84
- 126
1
vote
1 answer
Neatest way to get a distinct list of phone numbers (without removing original formatting)?
We have a master Person record and one (or more) duplicate Persons and we are merging their data, prioritising the master over the duplicate(s).
When it comes to phone numbers the goal is to merge their data, with a single phone number going into…

hawbsl
- 15,313
- 25
- 73
- 114
1
vote
1 answer
Deduplication Suggestions for Email Storage
The proposed storage model is to store attachments in separate files (or blobs), and to store the email itself as a MIME multipart message, with references to the attached file and how it was encoded. This allows the user to Show Original, but does…

700 Software
- 85,281
- 83
- 234
- 341
1
vote
1 answer
How can I make doctrine not persist duplicate objects in my database?
I have two different kinds of objects: Ride and Location.
A Ride has an origin and a destination which are Location objects.
Location does not point back to the Ride.
This means I have a many-to-one uni-directional relationship in doctrine.
How can…

Patrick James McDougle
- 2,072
- 18
- 24
1
vote
3 answers
MySQL distinct query returns rows with duplicate information, need deduplication
I have a table similar to the one shown below in a MySQL database:
+----------+----------+----------+----------+----------+
| Column_A | Column_B | Column_C | Column_D | Column_E |
+----------+----------+----------+----------+----------+ …

Prasad
- 13
- 1
- 3
1
vote
2 answers
Building A Deduplication Application For OS X, What/How Should I Use As The Hash For Files
I am about to embark on a programming journey, which undoubtedly will end in failure and/or throwing my mouse through my Mac, but it's an interesting problem.
I want to build an app, which scans starting at some base directory and recursively loops…

Justin
- 42,716
- 77
- 201
- 296
1
vote
3 answers
SQL Server 2008 De-duping
Long story short, I took over a project and a table in the database is in serious need of de-duping. The table looks like this:
supply_req_id | int | [primary key]
supply_req_dt | datetime |
request_id | int | [foreign key]
supply_id …

Andy Evans
- 6,997
- 18
- 72
- 118
1
vote
1 answer
Advice and tools to help normalize a database
I have 7 MySQL tables that contain partly overlapping and redundant data in approximately 17000 rows. All tables contain names and addresses of schools. Sometimes the same school is duplicated in a table with a slightly different name, and sometimes…

neo2862
- 1,496
- 1
- 13
- 27