Questions tagged [deduplication]

De-duplication is the process of removing duplicated or redundant data from a database.

139 questions
0
votes
1 answer

Dedup multi line records with perl

I have multi-line records in a text file I'd like to dedupe using perl: Records are delimited by "#end-of-record" string and look like this: CAPTAIN GIBLET'S NEWT CORRAL 555 RANDOM ST TARDIS, CT 99999 We regret to inform you that we must repossess…
Bubnoff
  • 3,917
  • 3
  • 30
  • 33
0
votes
1 answer

SQL query to map duplicated entries for data enrichment

I'm fairly new to PostgreSQL. I'm planning on running a data set of products through mechanical turk to enrich the data with pricing information. The problem is that I have 80,000 records uploaded by users, many of which are in actuality…
Nick Lashinsky
  • 119
  • 1
  • 2
  • 10
0
votes
1 answer

Single Instance Storage layers

I have a data storage requirement which is an excellent candidate for single instance storage and deduplication. Can anyone suggest any .Net compatible libraries or systems which handles SIS and deduplication, either with SQL Server as an actual…
user32826
0
votes
2 answers

De-duplicating similar but not identical URLs with a SQL query

I have a dataset with thousands of URLs stored in a column called Website (type VARCHAR) in a table called WebsiteData. There are many pairs of URLs (stored in separate rows) that are identical except that one begins with www, e.g. www.google.com…
zgall1
  • 2,865
  • 5
  • 23
  • 39
0
votes
1 answer

Flask, gunicorn, redis - Getting 500'd the 3rd route, but POST works in previous steps

I'm trying to set up a local copy of web-dedupe working with the default setup, but it simply will not work for me after the third step. I'm able to upload the CSV, but after the fields are selected and the submit button is hit, I get an error: The…
bootlear
  • 13
  • 3
0
votes
2 answers

Fuzzy matching Informatica vs SQL

We are currently debating whether to implement pairwise matching functions in SQL to perform fuzzy matching on invoice reference numbers, or go down the route of using Informatica. Informatica is a great solution (so ive heard) however im not…
0
votes
1 answer

Need Client ID match in self join, but only bring me back the most current transaction (which is the highest transaction number)

Can someone provide me with a query format? I know I need to join the table to itself, but to get the following I am totally lost. Column A - "Client ID" is a hard number (no duplicates at all). Column B - "Transactions" contains multiple…
0
votes
2 answers

Mysql deduplicate records in single query

I have the following table: CREATE TABLE `relations` ( `id` int(11) NOT NULL AUTO_INCREMENT, `relationcode` varchar(25) DEFAULT NULL, `email_address` varchar(100) DEFAULT NULL, `firstname` varchar(100) DEFAULT NULL, `latname` varchar(100)…
ErikL
  • 2,031
  • 6
  • 34
  • 57
0
votes
0 answers

Dedupe SQL Server Table [stored procedure]

I'm looking to dedupe a table using a stored procedure. There is no 1 column that is unique, so I'd have to combine 2 or more columns to get a unique identifier. ID column is identity int, but is generated by sql automatically at the time data is…
d90
  • 767
  • 2
  • 10
  • 28
0
votes
0 answers

HBase without using HDFS

I am doing a little research project and I am thinking about using HBase for it. I have read in the quick start guide that HBase can be set up using the local file system. I was reading this guy's paper:…
Derek
  • 11,715
  • 32
  • 127
  • 228
0
votes
2 answers

Update row null fields with values from similar rows (same "key")

My question is kind of hard to explain in title so I'll show the data and goal. There is a MySQL table with following structure: CREATE TABLE customerProjectData( idCustomer INT NOT NULL, idProject INT DEFAULT NULL, comePersons SMALLINT…
Joe
  • 2,551
  • 6
  • 38
  • 60
0
votes
3 answers

How to get unique rows using SET class from an Arraylist of "Arraylist string objects" of Type Setters & Getters class

I need your help in java code such as how can I get unique records from an arraylist which is multidimensional array of casting a class of Value objects(setters and getters). I'm reading a table and putting all records in an ararylist of arraylist.…
user2682165
  • 63
  • 1
  • 2
  • 7
0
votes
0 answers

SQL: Fastest Way to Dedupe to Canonical Ids

I have an interesting SQL task and though I would ask the community if anyone knows a fast way to accomplish it. I have 2 slow solutions, but I'm wondering if I am missing something faster. Here is the task: Given a list of records in a table,…
David Williams
  • 8,388
  • 23
  • 83
  • 171
0
votes
1 answer

Hashbased data dedeuplication

I am working on a project where I will get the data from the user's input form (no file processing). To avoid the duplication, I want to use either (fixed-length or fixed block) or (Variable length or Variable block). Which one is the better…
plzdontkillme
  • 1,497
  • 3
  • 20
  • 38
0
votes
1 answer

Python Dedup/Merge List of Dicts

Say that I have a list of dicts: list = [{'name':'john','age':'28','location':'hawaii','gender':'male'}, {'name':'john','age':'32','location':'colorado','gender':'male'}, …
MTP
  • 387
  • 1
  • 3
  • 8