De-duplication is the process of removing duplicated or redundant data from a database.
Questions tagged [deduplication]
139 questions
5
votes
4 answers
finding items to de-duplicate
I have a pool of data (X1..XN), for which I want to find groups of equal values. Comparison is very expensive, and I can't keep all data in memory.
The result I need is, for example:
X1 equals X3 and X6
X2 is unique
X4 equals X5
(Order…

peterchen
- 40,917
- 20
- 104
- 186
5
votes
1 answer
How do I check for duplicate data on ElasticSearch?
When storing some documents, it should store the nonexistent and ignore the rest (should this be done at application level, maybe checking if document's id already exists, etc.?)

Matías Insaurralde
- 1,202
- 10
- 23
4
votes
3 answers
How to exclude duplicate records from a large data feed?
I have started working with a large dataset that is arriving in JSON format. Unfortunately, the service providing the data feed delivers a non-trivial number of duplicate records. On the up-side, each record has a unique Id number stored as a 64…

gras
- 55
- 1
- 6
4
votes
5 answers
How to de-dupe a List of Objects?
A Rec object has a member variable called tag which is a String.
If I have a List of Recs, how could I de-dupe the list based on the tag member variable?
I just need to make sure that the List contains only one Rec with each tag value.
Something…

Daniel K.
- 105
- 2
- 5
4
votes
1 answer
HTTP Spec: PUT without data transfer, since hash of data is known to server
Does the HTTP/WebDav spec allow this client-server dialog?
client: I want to PUT data to /user1/foo.mkv which has this hash sum: HASH
server: OK, PUT was successful, you don't need to send the data since I already know the data with this hash…

guettli
- 25,042
- 81
- 346
- 663
4
votes
3 answers
C++ Remove duplication in a set of list
I'm trying to remove duplications in the return list in this question
Given a collection of candidate numbers (C) and a target number (T), find all unique combinations in C where the candidate numbers sums to T.
Each number in C may only be used…

1736964698
- 310
- 1
- 5
4
votes
2 answers
Python 2.7: Dedup list by adding suffix
I'm not sure I'm thinking about this problem correctly. I'd like to write a function which takes a list with duplicates and appends an iterating suffix to "dedup" the list.
For example:
dup_list =…

JMcClure
- 701
- 1
- 8
- 16
3
votes
2 answers
How do I enumerate and deduplicate 9 items allocated in triplets to each of 3 inheritors... and beyond?
This question is related to the context described in Seeking a solution or a heursitic approxmation for the 3-partition combinatorial situation. The task is distribute approximately 48 pieces of inherited jewelry, each with its appraised value, to…

GrabsAtStrawberries
- 51
- 3
3
votes
4 answers
Delete rows without leading zeros
I have a table with a column (registration_no varchar(9)). Here is a sample:
id registration no
1 42400065
2 483877668
3 019000702
4 837478848
5 464657588
6 19000702
7 042400065
Please take note of registration numbers like …

faithy
- 33
- 3
3
votes
1 answer
Tool for helping with deduplication of Perl code?
I'm looking for some tool/library that would scan given project tree, and report on code duplicates - i.e. blocks of code that are repeated in various files.
Is there anything like this?
As it is now, I have to view them (files) all, and search for…
user80168
3
votes
2 answers
How do I keep a count of deduplicated messages from Logstash in ElasticSearch?
I see from this question that document_id can easily be used in Logstash to replace a duplicate record in ElasticSearch. How would I add/increment a count value for e.g. repeating syslog messages? Instead of just replacing the record I want to…

cfiske
- 184
- 6
3
votes
1 answer
Inserting millions of records with deduplication SQL
This is a theoretical scenario, and I am more than amateur when it comes to large scale SQL databases...
How would I go about inserting around 2million records into an existing database off 6million records (table1 into table2), whilst at the same…

kirgy
- 1,567
- 6
- 23
- 39
3
votes
3 answers
How can I remove duplicates (deduplicate) a mbox format email mailbox?
I've got a mbox mailbox containing duplicate copies of messages, which differ only in their "X-Evolution:" header.
I want to remove the duplicate ones, in as quick and simple a way as possible. It seems like this would have been written already, but…

JesseW
- 1,255
- 11
- 19
3
votes
4 answers
Datastructure choices for highspeed and memory efficient detection of duplicate of strings
I have a interesting problem that could be solved in a number of ways:
I have a function that takes in a string.
If this function has never seen this string before, it needs to perform some processing.
If the function has seen the string before,…

Jonathan Holland
- 1,243
- 11
- 17
2
votes
6 answers
Deduplicating HashMap Values
I'm wondering if anyone knows a good way to remove duplicate Values in a LinkedHashMap? I have a LinkedHashMap with pairs of String and List. I'd like to remove duplicates across the ArrayList's. This is to improve some downstream…

Jeff
- 877
- 2
- 11
- 17