De-duplication is the process of removing duplicated or redundant data from a database.
Questions tagged [deduplication]
139 questions
0
votes
1 answer
Dedup multi line records with perl
I have multi-line records in a text file I'd like to dedupe using perl:
Records are delimited by "#end-of-record" string and look like this:
CAPTAIN GIBLET'S NEWT CORRAL
555 RANDOM ST
TARDIS, CT 99999
We regret to inform you that we must repossess…

Bubnoff
- 3,917
- 3
- 30
- 33
0
votes
1 answer
SQL query to map duplicated entries for data enrichment
I'm fairly new to PostgreSQL.
I'm planning on running a data set of products through mechanical turk to enrich the data with pricing information. The problem is that I have 80,000 records uploaded by users, many of which are in actuality…

Nick Lashinsky
- 119
- 1
- 2
- 10
0
votes
1 answer
Single Instance Storage layers
I have a data storage requirement which is an excellent candidate for single instance storage and deduplication.
Can anyone suggest any .Net compatible libraries or systems which handles SIS and deduplication, either with SQL Server as an actual…
user32826
0
votes
2 answers
De-duplicating similar but not identical URLs with a SQL query
I have a dataset with thousands of URLs stored in a column called Website (type VARCHAR) in a table called WebsiteData. There are many pairs of URLs (stored in separate rows) that are identical except that one begins with www, e.g. www.google.com…

zgall1
- 2,865
- 5
- 23
- 39
0
votes
1 answer
Flask, gunicorn, redis - Getting 500'd the 3rd route, but POST works in previous steps
I'm trying to set up a local copy of web-dedupe working with the default setup, but it simply will not work for me after the third step. I'm able to upload the CSV, but after the fields are selected and the submit button is hit, I get an error:
The…

bootlear
- 13
- 3
0
votes
2 answers
Fuzzy matching Informatica vs SQL
We are currently debating whether to implement pairwise matching functions in SQL to perform fuzzy matching on invoice reference numbers, or go down the route of using Informatica.
Informatica is a great solution (so ive heard) however im not…

user3933946
- 41
- 3
0
votes
1 answer
Need Client ID match in self join, but only bring me back the most current transaction (which is the highest transaction number)
Can someone provide me with a query format? I know I need to join the table to itself, but to get the following I am totally lost.
Column A - "Client ID" is a hard number (no duplicates at all).
Column B - "Transactions" contains multiple…
0
votes
2 answers
Mysql deduplicate records in single query
I have the following table:
CREATE TABLE `relations` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`relationcode` varchar(25) DEFAULT NULL,
`email_address` varchar(100) DEFAULT NULL,
`firstname` varchar(100) DEFAULT NULL,
`latname` varchar(100)…

ErikL
- 2,031
- 6
- 34
- 57
0
votes
0 answers
Dedupe SQL Server Table [stored procedure]
I'm looking to dedupe a table using a stored procedure. There is no 1 column that is unique, so I'd have to combine 2 or more columns to get a unique identifier. ID column is identity int, but is generated by sql automatically at the time data is…

d90
- 767
- 2
- 10
- 28
0
votes
0 answers
HBase without using HDFS
I am doing a little research project and I am thinking about using HBase for it. I have read in the quick start guide that HBase can be set up using the local file system. I was reading this guy's paper:…

Derek
- 11,715
- 32
- 127
- 228
0
votes
2 answers
Update row null fields with values from similar rows (same "key")
My question is kind of hard to explain in title so I'll show the data and goal.
There is a MySQL table with following structure:
CREATE TABLE customerProjectData(
idCustomer INT NOT NULL,
idProject INT DEFAULT NULL,
comePersons SMALLINT…

Joe
- 2,551
- 6
- 38
- 60
0
votes
3 answers
How to get unique rows using SET class from an Arraylist of "Arraylist string objects" of Type Setters & Getters class
I need your help in java code such as how can I get unique records from an arraylist which is multidimensional array of casting a class of Value objects(setters and getters).
I'm reading a table and putting all records in an ararylist of arraylist.…

user2682165
- 63
- 1
- 2
- 7
0
votes
0 answers
SQL: Fastest Way to Dedupe to Canonical Ids
I have an interesting SQL task and though I would ask the community if anyone knows a fast way to accomplish it. I have 2 slow solutions, but I'm wondering if I am missing something faster.
Here is the task:
Given a list of records in a table,…

David Williams
- 8,388
- 23
- 83
- 171
0
votes
1 answer
Hashbased data dedeuplication
I am working on a project where I will get the data from the user's input form (no file processing). To avoid the duplication, I want to use either (fixed-length or fixed block) or (Variable length or Variable block).
Which one is the better…

plzdontkillme
- 1,497
- 3
- 20
- 38
0
votes
1 answer
Python Dedup/Merge List of Dicts
Say that I have a list of dicts:
list = [{'name':'john','age':'28','location':'hawaii','gender':'male'},
{'name':'john','age':'32','location':'colorado','gender':'male'},
…

MTP
- 387
- 1
- 3
- 8