Questions tagged [deduplication]

De-duplication is the process of removing duplicated or redundant data from a database.

139 questions
1
vote
1 answer

Mail deduplication multiple users

I'm currently deduplication emails on a per user (emailaccount) basis. I'm creating an sha512 hash of several headers (message-id, subject, from, date, to). And after that I'm storing full email (mime string) in a file and insert the metadata…
Floris
  • 299
  • 3
  • 17
1
vote
5 answers

Perl - While removing duplicates in one array, pop the element from another array

I have two arrays that are associated. The first has what would be a "key" in a hash, the second has the "value". There are multiple instances of each "key" in the array, and the value associated with each key can be either yes, or no. A quick…
ohm
  • 73
  • 4
1
vote
1 answer

copy all unique files in a directory based on hashes

file=$3 #Using $3 as I am using 1 & 2 in the rest of the script[that works] file_hash=md5sum "$file" | cut -d ' ' -f l #generates hashes for file for a in /path/to/source/* #loop for all files in directory do if [ "$file_hash" == $(md5sum "$a"…
1
vote
3 answers

Deduplicate string instances

I have array of nearly 1,000,000 records, each record has a field "filename". There are many records with exactly the same filename. My goal is to improve memory footprint by deduplicating string instances (filename instances, not records). .NET…
DxCK
  • 4,402
  • 7
  • 50
  • 89
1
vote
2 answers

Space efficient marketing email storage

I'm working on a mail gateway that would automatically provide (among other things) "view in browser" functionality for all emails that are being sent through it. This raises the need to store all emails somewhere so that they can be easily…
Sergey
  • 1,181
  • 7
  • 18
1
vote
1 answer

R How to remove duplicates from a list of lists

I have a list of lists that contain the following 2 variables: > dist_sub[[1]]$zip [1] 901 902 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 [26] 929 930 931 933 934 935 936 937 938 939 940 955 961 962…
1
vote
3 answers

Python remove duplicate cases that are in inverted matrix

I have a list that looks like this: relationShipArray = [] relationShipArray.append([340859419124453377, 340853571828469762]) relationShipArray.append([340859419124453377, 340854579195432961]) relationShipArray.append([340770796777660416,…
user2091936
  • 546
  • 2
  • 7
  • 28
1
vote
1 answer

Iterate over a list that contain duplicate elements

I'm trying to iterate a list that contains some duplicate elements. I'm using the amount of duplicates so I don't want to put the list in a set before I iterate over the list. I'm trying to count how many times the element appears and then write the…
novafluxx
  • 139
  • 9
1
vote
1 answer

Curated content vs feed content deduplication

On the homepage of the public website we have multiple modules divided between curated content (users manually select articles/publications) vs feed content (automatically populated module based on parameters and usually sorted by date). These…
Gabbar
  • 4,006
  • 7
  • 41
  • 78
1
vote
2 answers

Finding Similarity Between Addresses

I have written this following piece of code for finding the similarity between two postal addresses double similarAddr(String resAddr,String newAddr) { String sortedResAddr=asort(resAddr); //asort alphabetically sorts the sentence…
Joy
  • 4,197
  • 14
  • 61
  • 131
1
vote
2 answers

Aggregating and deduplicationg information extracted from multiple web sites

I am working on building a database of timing and address information of restaurants those are extracted from multiple web sites. As information for same restaurants may be present in multiple web sites. So in the database I will have some nearly…
Joy
  • 4,197
  • 14
  • 61
  • 131
1
vote
2 answers

Duplication Algorithm in Java

I am looking for some duplicate matching algorithm in Java.I have senario i.e I have two tables.Table 1 contain 25,000 records strings within one coloumn and similarly Table 2 contain 20,000 records strings. I want to check duplicate records in both…
asher baig
  • 39
  • 3
1
vote
1 answer

Powershell: Deduping an array

I have a pipe delimited flat file from which I need to deduplicate the entries based on an object, to be specific a part of file is:…
DhawalV
  • 188
  • 1
  • 2
  • 16
1
vote
2 answers

Deduplicating .pst files to find unique emails

I have a (what seems like) a large task at hand. I need to go through different archive volumes of multiple folders (we're talking terabytes of data). Within each folder is a .pst file. Some of these folders (and therefore files) may be exactly…
User_1403834
  • 411
  • 2
  • 7
  • 20
1
vote
2 answers

What are the best practices to create a solr based de-duplication system?

I am setting up a solr search based de-duplication system that would return search results matching the search criteria. I have used dataimport handler to pull data from database and create indexed documents on the Solr server. My solr schema is as…
Tushu
  • 1,866
  • 3
  • 14
  • 19