Questions tagged [deduplication]

De-duplication is the process of removing duplicated or redundant data from a database.

139 questions
2
votes
1 answer

Does google adwords de-dupe conversions?

I'm using AdWords to track conversions on an ajax site. It works well for single conversions that have a unique label and value. The Problem: On the site I have a use case where a user can fire multiple very similar looking conversions in short…
Jesse
  • 10,370
  • 10
  • 62
  • 81
2
votes
3 answers

Deduplication and filtering of Add/Remove Programs list (VBScript)

This script works and tells and me what is installed in Program files. Two problems Duplicate lines i.e AVG 2011 Ver: 10.0.1204 AVG 2011 Ver: 10.0.1204 Installed: 27/01/2011 and I don't want to include lines that have key words…
icecurtain
  • 675
  • 3
  • 9
  • 30
2
votes
0 answers

R generate json string from two key columns

New to R. I'm developing an entity resolution algorithm using the RecordLinkage package. I've had pretty good success so far - using dedup, I end up with a data frame, two columns of which are keys of matched records, as below: x <- list(key1 =…
2
votes
2 answers

Generating numpy array of indices for a deduplicated set of points

I have an array of a minimum of 10s of thousands of points (up to 3 billion) some of which are duplicated. I'd like to deduplicate the points and generate an index array which retains the original sequence of the duplicated points. For example: x =…
Brian Bruggeman
  • 5,008
  • 2
  • 36
  • 55
2
votes
1 answer

SBT Allow duplicates during assembly

enter code hereIs there a way to turn off reduplication in SBT's assembly plugin? I've been cleaning out an sbt assembly build the old fashioned way, using sbt dependency-graph to remove jar files which have differing versions of the same file. if…
jayunit100
  • 17,388
  • 22
  • 92
  • 167
2
votes
1 answer

Deduping Column pairs in R

I have a dataframe containing 7 columns and would like to records that have same info in the first two columns even they are in reverse order. Here is a snippet of my df zip1 zip2 APP PCR SCR APJ PJR 1 01701 01701…
ben890
  • 1,097
  • 5
  • 25
  • 56
2
votes
5 answers

Deduping SQL Server table

I have an issue. I have a table with almost 2 billion rows (yeah I know...) and has a lot of duplicate data in it which I'd like to delete from it. I was wondering how to do that exactly? The columns are: first, last, dob, address, city, state, zip,…
Sal
  • 295
  • 3
  • 5
  • 13
2
votes
3 answers

Can we write a generic array/slice deduplication in go?

Is there a way to write a generic array/slice deduplication in go, for []int we can have something like (from http://rosettacode.org/wiki/Remove_duplicate_elements#Go ): func uniq(list []int) []int { unique_set := make(map[int] bool,…
Ali
  • 18,665
  • 21
  • 103
  • 138
2
votes
1 answer

Deciding key value pair for deduplication using hadoop mapreduce

I want to implement deduplication of files using Hadoop Mapreduce. I plan to do it by calculating MD5 sum of all the files present in the input directory in my mapper function. These MD5 hash would be the key to the reducer, so files with the same…
ManTor
  • 23
  • 2
2
votes
2 answers

Data matching/ deduplication Sql server 2008 R2

What are the options for making a data cleansing process (deduplication/matching) when dealing with MS SQL Server 2008 R2? Or better yet how can I weight scores on a matching process on columns of a row? The situation is the following: I have a…
2
votes
1 answer

how does all the existing directories get mounted at the mount point when using FUSE?

I'm trying to build a new filesystem with deduplication using FUSE. I tried running the fusexmp_fh.c provided in the example section of the FUSE. However after mounting the filesystem at a mount point, I can see all the existing directories inside…
gunner4evr
  • 79
  • 7
2
votes
2 answers

Duplicate Key Filtering

I am looking for a distributed solution to screen/filter a large volume of keys in real-time. My application generates over 100 billion records per day, and I need a way to filter duplicates out of the stream. I am looking for a system to store a…
scottw
  • 51
  • 1
  • 3
2
votes
6 answers

C Programming: how to avoid code duplication without losing clarity

edit: Thanks to all repliers. I should have mentioned in my original post that I am not allowed to change any of the specifications of these functions, so solutions using assertions and/or allowing to dereference NULL are out of the question. With…
Elad Shtiegmann
  • 211
  • 2
  • 7
2
votes
2 answers

List Comprehension Mystery - Python

I have created two CSV lists. One is an original CSV file, the other is a DeDuped version of that file. I have read each into a list and for all intents and purposes they are the same format. Each list item is a string. I am trying to use a list…
2
votes
0 answers

DeDuping millions of rows using LOAD DATA INFILE or other solution

Good day to all. I know this topic comes up a lot and apologize for any redundancy but I need you MYSQL gurus. I have tried several solutions that have been posted here to no avail. The solutions either take too long and/or more likely I just don't…
1 2
3
9 10