Questions tagged [deduplication]

De-duplication is the process of removing duplicated or redundant data from a database.

139 questions
1
vote
2 answers

Scrape unique image URLs from HTML

Using PHP to curl a web page (some URL entered by user, let's assume it's valid). Example: http://www.youtube.com/watch?v=Hovbx6rvBaA I need to parse the HTML and extract all de-duplicated URL's that seem like an image. Not just the ones in img…
ronik
  • 11
  • 4
1
vote
1 answer

Delete duplicates from two columns using the first one as a criterion

I have a problem with Excel 2007 and I really don't know what to do. I have a file with about 200k+ data in two columns. In column A (time) some values duplicate, and so in column B (amp). I would like to delete duplicates from two columns at once,…
1
vote
1 answer

Explicit sort parallelization via xargs -- Incomplete results from xargs --max-procs

Context I need to optimize deduplication using 'sort -u' and my linux machine has an old implementation of 'sort' command (i.e. 5.97) that has not '--parallel' option. Although 'sort' implements parallelizable algorithms (e.g. merge-sort), I need to…
Manolo
  • 1,500
  • 1
  • 11
  • 15
1
vote
1 answer

Solr Deduplication not working

I'm using Solr 5.2.1 and I have a field "url" that needs to be unique. I have followed https://wiki.apache.org/solr/Deduplication and I cant still update the index with the same url many times and Solr deduplication did not work to stop that from…
1
vote
1 answer

Label:- XMLContent De-duplication

Question 1---> Currently i am working on a project where in we translate the English content to other 17 languages. To reduce the translation cost currently we are using MD5 hashcode and based on the results we decide whether the topic is…
1
vote
1 answer

T-SQL Query Results Not as Expected Deduplication

I am attempting to get all records where and Id field exists more than once, trouble is my query is returning nothing and I have no idea as to why!? And this is the only method I know. Some more information: There are up to 8 of the same Order…
Matt
  • 4,107
  • 3
  • 29
  • 38
1
vote
1 answer

Improving Run Time for deduping lists based on only certain columns in Python

I have csv two files. I'm trying to remove all rows where certain columns match. I thought I'd use lists in Python to do this. I thought it'd be fast, but it's running way too slow. I only want to compare the first 3 columns as the last 2 are…
M E
  • 13
  • 3
1
vote
1 answer

getting multiple values out of an SQL statement

What I am trying to is work out whether there are teachers with duplicate initials. I tried to do this by returning one value from the database file with the searched for initials. Then returning all the values with the searched initials. Then I…
Ben
  • 529
  • 1
  • 4
  • 8
1
vote
2 answers

Python - Selecting All Row Values That Meet A particular Criteria Once

I have a form set up with the following fields: Date Time, ID, and Address. This form auto assigns each entry a unique id string (U_ID) and then this data is later output to a csv with headers and rows something like this: Date Time ID U_ID…
roliv
  • 33
  • 4
1
vote
0 answers

Compute SHA1 of partial download

I'm scraping a gazillion audio files from government websites and I want to avoid getting duplicate files. With small files I've scraped in the past, I download the entire file, compute a SHA1 hash for it and compare that against the items already…
mlissner
  • 17,359
  • 18
  • 106
  • 169
1
vote
0 answers

solr filter effects seen in analyzer but not score

As part of my fieldType definition I have a filter as can be found here https://github.com/gaillard/solr-filter-dedup that will deduplicate tokens in that field. When I use the solr analyzer I see duplicate terms being removed for the indexer, so it…
fields
  • 879
  • 2
  • 9
  • 19
1
vote
2 answers

Data deduplication - Postfix server?

I have a mail server running Postfix. Each message is saved as a file in filesystem, so I'm figuring out if there is a way to reduce duplicated files and so reduce disk space usage. I tried to install and use opendedup, but I really did not…
jvolt
  • 19
  • 3
1
vote
5 answers

de-duplicate a list of strings

I very commonly run into this issue: I have a csv file with a list of data in it I need to remove duplicates (or sometimes, find the values that are duplicated) The csv is easy to bring into excel, but I can't find (or, never remember) a good…
Brady Moritz
  • 8,624
  • 8
  • 66
  • 100
1
vote
0 answers

GDB: Lessfs; How to Trace

I am trying to trace this open source program called lessfs: and inline data deduplication filesystem for linux, but I am having trouble stepping through step by step using GDB Lessfs can be found here: http://www.lessfs.com/wordpress/ Are there any…
humblebeast
  • 303
  • 3
  • 16
1
vote
1 answer

Remove duplicates from LUA Table by timestamp

I was on stack a few days back for help inserting records to prevent duplicates. However the process to enter these is slow and they slip in. I have a user base of about 10,000 players, and they have duplicate entries.. I've been trying to filter…