Best way to match 4 million rows of data against each other and sort results by similarity?

Question

We use libpuzzle ( http://www.pureftpd.org/project/libpuzzle/doc ) to compare 4 million images against each other for similarity.

It works quite well.

But rather then doing a image vs image compare using the libpuzzle functions, there is another method of comparing the images.

Here is some quick background:

Libpuzzle creates a rather small (544 bytes) hash of any given image. This hash can in turn be used to compare against other hashes using libpuzzles functions. There are a few APIs... PHP, C, etc etc... We are using the PHP API.

The other method of comparing the images is by creating vectors from the given hash, here is a paste from the docs:

Cut the vector in fixed-length words. For instance, let's consider the following vector:

[ a b c d e f g h i j k l m n o p q r s t u v w x y z ]

With a word length (K) of 10, you can get the following words:

[ a b c d e f g h i j ] found at position 0 [ b c d e f g h i j k ] found at position 1 [ c d e f g h i j k l ] found at position 2 etc. until position N-1

Then, index your vector with a compound index of (word + position).

Even with millions of images, K = 10 and N = 100 should be enough to have very little entries sharing the same index.

So, we have the vector method working. Its actually works a bit better then the image vs image compare since when we do the image vs image compare, we use other data to reduce our sample size. Its a bit irrelevant and application specific what other data we use to reduce the sample size, but with the vector method... we would not have to do so, we could do a real test of each of the 4 million hashes against each other.

The issue we have is as follows:

With 4 million images, 100 vectors per image, this becomes 400 million rows. We have found MySQL tends to choke after about 60000 images (60000 x 100 = 6 million rows).

The query we use is as follows:

SELECT isw.itemid, COUNT(isw.word) as strength
FROM vectors isw
JOIN vectors isw_search ON isw.word = isw_search.word
WHERE isw_search.itemid = {ITEM ID TO COMPARE AGAINST ALL OTHER ENTRIES}
GROUP BY isw.itemid;

As mentioned, even with proper indexes, the above is quite slow when it comes to 400 million rows.

So, can anyone suggest any other technologies / algos to test these for similarity?

We are willing to give anything a shot.

Some things worth mentioning:

Hashes are binary.
Hashes are always the same length, 544 bytes.

The best we have been able to come up with is:

Convert image hash from binary to ascii.
Create vectors.
Create a string as follows: VECTOR1 VECTOR2 VECTOR3 etc etc.
Search using sphinx.

We have not yet tried the above, but this should probably yield a bit better results than doing the mysql query.

Any ideas? As mentioned, we are willing to install any new service (postgresql? hadoop?).

Final note, an outline of exactly how this vector + compare method works can be found in question Libpuzzle Indexing millions of pictures?. We are in essence using the exact method provided by Jason (currently the last answer, awarded 200+ so points).

I had pretty good experience with minhash clustering and Mahout on Hadoop. Maybe you want to try that out. — Thomas Jungblut, Mar 30 '13 at 09:42
Using a full text search engine should work very well. Only make sure to use the right ranking method. You probably don't want to use things like TF-IDF. — nwellnhof, Mar 30 '13 at 10:36
The other question says 'similar sequences have the same value at the same position', which means you probably only need to store every 10th byte, and xor this with input image. If you get a few bytes xor=0, then do a full compare. 4 million 55byte short hashes would fit in 220Mb, and you could scan all of this sub second on a good system. Depends how fast you want to go. Couldnt find any sample hashes to test this theory though... And didnt want to answer without testing for reliability — rlb, Mar 30 '13 at 11:10

score 0 · Answer 1 · answered Mar 30 '13 at 13:58

Don't do this in a database, just use a simple file. Below i have shown a file with some of the words from the two vectores [abcdefghijklmnopqrst] (image 1) and [xxcdefghijklxxxxxxxx] (image 2)

 <index>       <image>
0abcdefghij      1
1bcdefghijk      1
2cdefghijkl      1
3defghijklm      1
4efghijklmn      1
...
...
0xxcdefghij      2
1xcdefghijk      2
2cdefghijkl      2
3defghijklx      2
4efghijklxx      2
...

Now sort the file:

  <index>       <image>
0abcdefghij      1
0xxcdefghij      2
1bcdefghijk      1
1xcdefghijk      2
2cdefghijkl      1       
2cdefghijkl      2       <= the index is repeated, those we have a match
3defghijklm      1
3defghijklx      2
4efghijklmn      1
4efghijklxx      2

When the file have been sorted it's easy to find the records that have the same index. Write a small program or something that can run through the sorted list and find the duplicates.

score 0 · Answer 2 · answered Mar 31 '13 at 15:03

i have opted to 'answer my own' question as we have found a solution that works quite well.

in the initial question, i mentioned we were thinking of doing this via sphinx search.

well, we went ahead and did it and the results are MUCH better then doing this via mysql.

so, in essence the process looks like this:

a) generate hash from image.

b) 'vectorize' this hash into 100 parts.

c) binhex (binary to hex) each of these vectors since they are in binary format.

d) store in sphinx search like so:

itemid | 0_vector0 1_vector1 2_vec... etc

e) search using sphinx search.

initially... once we had this sphinxbase full of 4 million records, it would still take about 1 second per search.

we then enabled distributed indexing for this sphinxbase, on 8 cores, and now are about to query about 10+ searches per second. this is good enough for us.

one final step would be to further distribute this sphinxbase over the multiple servers we have, further utilizing the unused cpu cycles we have available.

but for the time being, good enough. we add about 1000-2000 'items' per day, so searching thru 'just the new ones' will happen quite quickly... after we do the initial scan.

Best way to match 4 million rows of data against each other and sort results by similarity?

2 Answers2