De-duplication is the process of removing duplicated or redundant data from a database.
Questions tagged [deduplication]
139 questions
1
vote
1 answer
Keep one entry of duplicate articles with SOLR deduplication
I have used Solr deduplication with following setting in solrconfig.xml
true

John Smith
- 11
- 6
1
vote
2 answers
SQL: how to select the row with most known values?
I have the table of users (username, gender, date_of_birth, zip) where the user's id is permanent but the user could be registered many times in the past where sometimes he filled out all the data and sometimes not. Besides that, he could change…

Niko Gamulin
- 66,025
- 95
- 221
- 286
0
votes
2 answers
Deduplication mysql result using PHP
I have a table with entries such as:
123 (DVD)
123 [DVD] [2007]
125 [2009]
189 (CD)
when I present these to the user in an autocomplete field I do away with anything between either () or [] as these are not relevant, but, as you can see from the…

StudioTime
- 22,603
- 38
- 120
- 207
0
votes
1 answer
A Deduplication DB schema with sqlalchemy - How to represent a group with ORM semantics?
I'm trying to create a simple representation for an entity deduplication schema using mysql, and using sqlalchemy for programmatic access.
I'm trying to achieve a specific effect which I think is kind of a self-referential query but i'm not…

Jim
- 1
0
votes
5 answers
What's the best way to remove duplicates from a string in PHP (or any language)?
I am looking for the best known algorithm for removing duplicates from a string. I can think of numerous ways of doing this, but I am looking for a solution that is known for being particularly efficient.
Let's say you have the following…

chaimp
- 16,897
- 16
- 53
- 86
0
votes
2 answers
De-dupe NSArray of NSDictionaries based on specific keys
I am attempting to de-dupe an NSArray of NSDictionaries based on specific keys in the dictionaries. What I have looks something like this:
NSDictionary *person1 = [NSDictionary dictionaryWithObjectsAndKeys:@"John", @"firstName", @"Smith",…

mag725
- 695
- 2
- 9
- 22
0
votes
1 answer
postgresql: Finding the ids of rows that contain case-insensitive string duplication
I want to select and then delete a list of entries in my tables that have case-insensitive duplications.
In other words, there are these rows that are unique... ..but they're not unique…

Kzqai
- 22,588
- 25
- 105
- 137
0
votes
0 answers
deduplication of ids in SELECT vs SELECT GROUP BY
We have an issue around deduplication when our data is spread across multiple indexes, and there exists a particular id in more than one index.
When doing a straight select, we get X records back, but when we do a group by, we will get counts that…

Adam Morgan
- 425
- 1
- 3
- 17
0
votes
1 answer
Removing duplicate records from JOIN in MS Access
My co-worker asked me for help with a query in MS Access that joins three tables. I have confirmed that the order and inner/outer status of the JOIN is what my co-worker wants. (They have three tables, A, B, and C; they want all records from table…

Codes with Hammer
- 788
- 3
- 16
- 47
0
votes
1 answer
SimHash deduplication output in MapReduce
I am implementing the SimHash algorithm [1] to deduplicate a dataset using MapReduce.
For example, if I have 3 documents Doc1, Doc2, Doc3, Doc4. Suppose that Doc1 is similar to Doc3 with a hamming distance less than 3. Then after doing deduplication…

Daisy
- 847
- 3
- 13
- 34
0
votes
0 answers
Dedupe rows and add up corresponding integers PHP/MySQL
I have a large MySQL table that includes events as rows. Each row has a description and a corresponding value. The titles are duplicated throughout however and I would like to deduplicate them and sum together their values in PHP.
For example, if I…

danmtslatter
- 11
- 1
0
votes
1 answer
Apache Solr 5 - deduplicating data within a field
Here is my question (pardon the wordiness):
I have millions of documents and all of them are unique.
However, all documents contain a 'description' field and this field contains data that only has a few different variations in the text across all…

Jeremy
- 243
- 2
- 10
0
votes
0 answers
Record Matching-performance improvement
I am doing record matching to find out possible duplicates records. A record said to be a duplicate of other record if (firstname and lastname) and (phone or email)). Name fields are compare either exact or fuzzy(distance, phonetic) and phone and…

VirtualLogic
- 706
- 1
- 10
- 25
0
votes
2 answers
Remove duplicates SQL while ignoring key and selecting max of specified column
I have the following sample data:
| key_id | name | name_id | data_id |
+--------+-------+---------+---------+
| 1 | jim | 23 | 098 |
| 2 | joe | 24 | 098 |
| 3 | john | 25 | 098 |
| 4 | jack | …

JDE876
- 407
- 1
- 5
- 16
0
votes
1 answer
How do we account for tipped transactions with Yodlee which are duplicated, with differing amounts?
We have a Yodlee integration and are experiencing an issue with double transactions posting. Here's are the conditions with scenario:
1) It is a tipped situation where the credit card is run at an amount and then later on a tip is added so the card…

Jay
- 1
- 2