Java: Using the Soundex Algorithm for a huge Database

Question

I have been using the Soundex Algorithm, which I found ready in Java http://introcs.cs.princeton.edu/java/31datatype/Soundex.java.html . The main use of the program is to ready a .cvs file and then after it saves its entries into arrays, it checks with the help of this algorithm one of these arrays for phonetic similarities. (More about the soundex algorithm http://en.wikipedia.org/wiki/Soundex).

My .cvs file has more or less 200.000 entries, so for that reason, it takes 5 hours to check the 30.000 entries, which fact I consider quite slow. [My algorithm checks every entry of the array with all the other entries, except the ones that are already checked - So, I don't think that there is a problem here].

So, my question is: Is there a way to reduce this time?

I have been thinking about connecting directly my database to the program with the help of SQL but I don't know if there is another way to do that, which would be faster.

Please any suggestion would be very helpful.

probably not a good fit for SO... but yes relational databases are pretty good at soundex set comparisons. certainly there are ways to hook your java code to your database - tons of ways. — Randy, Jan 10 '13 at 17:01
200,000 isn't a huge database. I would suspect your algorithm. You need to make sure every entry is only converted once, as checking each entry against all the others is O(N**2), even if you do it properly without redundant comparisons. — user207421, Jan 10 '13 at 20:52

score 1 · Accepted Answer · answered Jan 10 '13 at 17:07

I don't know how the Java algorithm works. A lot of databases include a soundex() function. This converts a string into another string representing the sound.

You can then do the comparison between the resulting soundex strings.

This should go much, much faster than your current approach. You would have to test it to see if it returns acceptable results.

Actually, I just looked at the java code. You can take the same approach there. Go through the file, calculate the soundex for each entry. Then do the comparison afterwards -- perhaps by sorting the file and looking for duplicates.

Hey Gordon. You are quite right about this;) I don't know what I was thinking when I used the algorithm inside a function. It improved a lot the process. I will also try also to use the algorithm using the database too, but I will accept your answer. — Dimitra Micha, Jan 11 '13 at 08:41

score 0 · Answer 2 · answered Jan 10 '13 at 19:45

0

Just use the soundex implementation in your database. Most large popular databases have it built-in, e.g. PostgreSQL, MySQL or even Microsoft's T-SQL. It'll be easier to setup and likely a lot faster than whatever Java library you're using.

answered Jan 10 '13 at 19:45

Cerin

60,957
96
316
522

Thank you Cerin, I will have to try that too. I believe that it will just make a lot faster as well. I will accept Gordon's answer, since it really improved the whole procedure in java, without using smth else. – Dimitra Micha Jan 11 '13 at 08:44

Java: Using the Soundex Algorithm for a huge Database

2 Answers2