Questions tagged [deduplication]

De-duplication is the process of removing duplicated or redundant data from a database.

139 questions
2
votes
1 answer

Python list of tuples deduplication

I am trying to deduplicate a set of different lists of tuples one after other. The lists look like: A = [ (('X','Y','Z',2,3,4), ('A','B','C',5,10,11)), (('A','B','C',5,10,11), ('X','Y','Z',2,3,4)), (('T','F','J',0,1,0),…
2
votes
3 answers

Remove all duplicates from mysql table

I have a table which shows product ID's and how many times they have been given, 1 star, 2 stars, 3 stars, 4 stars and 5 stars when reviewed by customers along with the average rating for that product. There are some duplicate rows appearing in this…
Ben Paton
  • 1,432
  • 9
  • 35
  • 59
2
votes
2 answers

Google Script to remove duplicate rows based on 2 columns criteria

I am using a script in the that pulls events details from a calendar and adds them into column A and B in a spreadsheet, remove any duplicate events and then sort based on date. My hope is then that I can have staff add additional data about these…
DMarx
  • 357
  • 1
  • 5
  • 8
2
votes
2 answers

Deduplicaton / matching in Couchdb?

I have documents in couchdb. The schema looks like below: userId email personal_blog_url telephone I assume two users are actually the same person as long as they have email or personal_blog_url or telephone be identical. I have 3 views created,…
greeness
  • 15,956
  • 5
  • 50
  • 80
2
votes
1 answer

Removing tables with a join in SQL Server

I'm new to this DBA thing and I've been tasked with removing duplicates from a couple of tables. I'm working in SQL Server. They all have a field called LAST_UPD that tracks their last update. All the tables join to TABLE1 and each user is…
2
votes
3 answers

Detecting a duplicate customer

I have a bunch of customer data that is normalized into multiple tables. I want to decide the best criteria for make a best guess that a customer might be the same. There needs to be a balance between minimizing the number of duplicates but also…
Christopher Martin
  • 927
  • 1
  • 7
  • 9
1
vote
4 answers

Best way or algorithm to near duplicate check against huge list of files?

I am using python to neardupe huge list of file (over 20000 ) files. Totaling about 300 MB Current way is to do near-dupe checking using difflib's SequenceMatcher and getting result using QuickRatio . With 4 worker process it takes 25 hours to get…
Phyo Arkar Lwin
  • 6,673
  • 12
  • 41
  • 55
1
vote
4 answers

Dedupe records without DELETE

I need to bring back only one of the records from a duplicated row in SQL Server I have data like this ------------------------------------------- CustomerID, OrderID, ProductID,…
Sandeep Bansal
  • 6,280
  • 17
  • 84
  • 126
1
vote
1 answer

Neatest way to get a distinct list of phone numbers (without removing original formatting)?

We have a master Person record and one (or more) duplicate Persons and we are merging their data, prioritising the master over the duplicate(s). When it comes to phone numbers the goal is to merge their data, with a single phone number going into…
hawbsl
  • 15,313
  • 25
  • 73
  • 114
1
vote
1 answer

Deduplication Suggestions for Email Storage

The proposed storage model is to store attachments in separate files (or blobs), and to store the email itself as a MIME multipart message, with references to the attached file and how it was encoded. This allows the user to Show Original, but does…
700 Software
  • 85,281
  • 83
  • 234
  • 341
1
vote
1 answer

How can I make doctrine not persist duplicate objects in my database?

I have two different kinds of objects: Ride and Location. A Ride has an origin and a destination which are Location objects. Location does not point back to the Ride. This means I have a many-to-one uni-directional relationship in doctrine. How can…
1
vote
3 answers

MySQL distinct query returns rows with duplicate information, need deduplication

I have a table similar to the one shown below in a MySQL database: +----------+----------+----------+----------+----------+ | Column_A | Column_B | Column_C | Column_D | Column_E | +----------+----------+----------+----------+----------+ …
Prasad
  • 13
  • 1
  • 3
1
vote
2 answers

Building A Deduplication Application For OS X, What/How Should I Use As The Hash For Files

I am about to embark on a programming journey, which undoubtedly will end in failure and/or throwing my mouse through my Mac, but it's an interesting problem. I want to build an app, which scans starting at some base directory and recursively loops…
Justin
  • 42,716
  • 77
  • 201
  • 296
1
vote
3 answers

SQL Server 2008 De-duping

Long story short, I took over a project and a table in the database is in serious need of de-duping. The table looks like this: supply_req_id | int | [primary key] supply_req_dt | datetime | request_id | int | [foreign key] supply_id …
Andy Evans
  • 6,997
  • 18
  • 72
  • 118
1
vote
1 answer

Advice and tools to help normalize a database

I have 7 MySQL tables that contain partly overlapping and redundant data in approximately 17000 rows. All tables contain names and addresses of schools. Sometimes the same school is duplicated in a table with a slightly different name, and sometimes…
neo2862
  • 1,496
  • 1
  • 13
  • 27