1

I am looking for some duplicate matching algorithm in Java.I have senario i.e

I have two tables.Table 1 contain 25,000 records strings within one coloumn and similarly Table 2 contain 20,000 records strings. I want to check duplicate records in both table 1 and table 2. Records are like this format for example:

Table 1

Jhon,voltra

Bruce willis

Table 2

voltra jhon

bruce, willis

Looking for algoirthm which can find this type of duplicate string machting from these two tables in two different files. Can some you help me about two or more algorithm which can perform such queries in Java.

asher baig
  • 39
  • 3
  • 2
    Sounds like an specific logic to me. As such, it is up to you to implement it's behavior, which in this case, is to determine what is considered duplicate and what isn't. --- In other words, *"matching duplicates"* has ready algorithms. *"Matching duplicates this specific way"*, doesn't. – CosmicGiant Nov 26 '12 at 15:02
  • Are the only string formats used in those files "firstname lastname" and "lastname, firstname"? Are there others? Is there a limited number of formats, or should spelling mistakes and the like be considered as duplicates as well? – Daniel S. Nov 26 '12 at 15:05
  • Can you name those "matching duplicates" algorithm.May be, it is also seem to be firstname lastname and lastname, firstname but each table contain one coloumn only. – asher baig Nov 26 '12 at 15:16

2 Answers2

5

Read the two files into a normalised form so they can be compared. Use Set of these entries and retainAll() to find the intersection of these two sets. These are the duplicates.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • Can you provide names such algorithms which can be customized according to requirment ? – asher baig Nov 26 '12 at 15:17
  • I have added links which relate to the "names" or terms used. – Peter Lawrey Nov 26 '12 at 15:25
  • thanks for updating links, actually i have understand those techinques, My problem is i need to analyze some already existing algorithm.Looking for duplicate matching algorithm – asher baig Nov 26 '12 at 15:33
  • "i need to analyze some already existing algorithm" is this an algorithm you have already or you are looking for. What sort of analysis on the algo do you want to perform? – Peter Lawrey Nov 26 '12 at 16:10
  • No I don't know any algorithm yet..I am looking for good algorithm to whom i can try to find answer to these question how much time it will take to get exact result.How much time one algorithm take to find duplicate record in particular time.What is the complexity of Algorithm,How much these algorithm effective for huge amount of records,What is ratio of find duplicate algorithm – asher baig Nov 26 '12 at 16:25
  • The algo given is a good one. ;) The time complexity is O(N). You will spend more of your time reading the files rather than finding duplicates, and even that I expect to be quick. I would expect it to take well under a second to read, normalise the data, find duplicates and write the result to a file. One million records is not huge so I wouldn't worry about it until you have much, much more than this. I don't understand what you mean by "ratio of find duplicate algorithm" – Peter Lawrey Nov 26 '12 at 16:48
0

You can use a Map<String, Integer> (e.g. HashMap) and read the files line by line and insert the strings into the map, incrementing the value each time you find an existing entry.

You can then search through your map and find all entries with a count > 1.

assylias
  • 321,522
  • 82
  • 660
  • 783
Olaf Dietsche
  • 72,253
  • 8
  • 102
  • 198