De-duplication is the process of removing duplicated or redundant data from a database.
Questions tagged [deduplication]
139 questions
2
votes
1 answer
Does google adwords de-dupe conversions?
I'm using AdWords to track conversions on an ajax site. It works well for single conversions that have a unique label and value.
The Problem:
On the site I have a use case where a user can fire multiple very similar looking conversions in short…

Jesse
- 10,370
- 10
- 62
- 81
2
votes
3 answers
Deduplication and filtering of Add/Remove Programs list (VBScript)
This script works and tells and me what is installed in Program files.
Two problems
Duplicate lines
i.e
AVG 2011 Ver: 10.0.1204
AVG 2011 Ver: 10.0.1204 Installed: 27/01/2011
and
I don't want to include lines that have key words…

icecurtain
- 675
- 3
- 9
- 30
2
votes
0 answers
R generate json string from two key columns
New to R.
I'm developing an entity resolution algorithm using the RecordLinkage package. I've had pretty good success so far - using dedup, I end up with a data frame, two columns of which are keys of matched records, as below:
x <- list(key1 =…

Aaron McLendon
- 21
- 1
2
votes
2 answers
Generating numpy array of indices for a deduplicated set of points
I have an array of a minimum of 10s of thousands of points (up to 3 billion) some of which are duplicated. I'd like to deduplicate the points and generate an index array which retains the original sequence of the duplicated points.
For example:
x =…

Brian Bruggeman
- 5,008
- 2
- 36
- 55
2
votes
1 answer
SBT Allow duplicates during assembly
enter code hereIs there a way to turn off reduplication in SBT's assembly plugin?
I've been cleaning out an sbt assembly build the old fashioned way, using sbt dependency-graph to remove jar files which have differing versions of the same file.
if…

jayunit100
- 17,388
- 22
- 92
- 167
2
votes
1 answer
Deduping Column pairs in R
I have a dataframe containing 7 columns and would like to records that have same info in the first two columns even they are in reverse order.
Here is a snippet of my df
zip1 zip2 APP PCR SCR APJ PJR
1 01701 01701…

ben890
- 1,097
- 5
- 25
- 56
2
votes
5 answers
Deduping SQL Server table
I have an issue. I have a table with almost 2 billion rows (yeah I know...) and has a lot of duplicate data in it which I'd like to delete from it. I was wondering how to do that exactly?
The columns are: first, last, dob, address, city, state, zip,…

Sal
- 295
- 3
- 5
- 13
2
votes
3 answers
Can we write a generic array/slice deduplication in go?
Is there a way to write a generic array/slice deduplication in go, for []int we can have something like (from http://rosettacode.org/wiki/Remove_duplicate_elements#Go ):
func uniq(list []int) []int {
unique_set := make(map[int] bool,…

Ali
- 18,665
- 21
- 103
- 138
2
votes
1 answer
Deciding key value pair for deduplication using hadoop mapreduce
I want to implement deduplication of files using Hadoop Mapreduce. I plan to do it by calculating MD5 sum of all the files present in the input directory in my mapper function. These MD5 hash would be the key to the reducer, so files with the same…

ManTor
- 23
- 2
2
votes
2 answers
Data matching/ deduplication Sql server 2008 R2
What are the options for making a data cleansing process (deduplication/matching)
when dealing with MS SQL Server 2008 R2?
Or better yet how can I weight scores on a matching process on columns of a row?
The situation is the following: I have a…

MariaMadalina
- 479
- 6
- 20
2
votes
1 answer
how does all the existing directories get mounted at the mount point when using FUSE?
I'm trying to build a new filesystem with deduplication using FUSE.
I tried running the fusexmp_fh.c provided in the example section of the FUSE. However after mounting the filesystem at a mount point, I can see all the existing directories inside…

gunner4evr
- 79
- 7
2
votes
2 answers
Duplicate Key Filtering
I am looking for a distributed solution to screen/filter a large volume of keys in real-time. My application generates over 100 billion records per day, and I need a way to filter duplicates out of the stream. I am looking for a system to store a…

scottw
- 51
- 1
- 3
2
votes
6 answers
C Programming: how to avoid code duplication without losing clarity
edit: Thanks to all repliers. I should have mentioned in my original post that I am not allowed to change any of the specifications of these functions, so solutions using assertions and/or allowing to dereference NULL are out of the question.
With…

Elad Shtiegmann
- 211
- 2
- 7
2
votes
2 answers
List Comprehension Mystery - Python
I have created two CSV lists. One is an original CSV file, the other is a DeDuped version of that file. I have read each into a list and for all intents and purposes they are the same format. Each list item is a string.
I am trying to use a list…

MaxSavageKramer
- 55
- 5
2
votes
0 answers
DeDuping millions of rows using LOAD DATA INFILE or other solution
Good day to all. I know this topic comes up a lot and apologize for any redundancy but I need you MYSQL gurus.
I have tried several solutions that have been posted here to no avail. The solutions either take too long and/or more likely I just don't…

user2689658
- 31
- 2