Libpuzzle Indexing millions of pictures?

Question

its about the libpuzzle libray for php ( http://libpuzzle.pureftpd.org/project/libpuzzle ) from Mr. Frank Denis. I´am trying to understand how to index and store the data in my mysql database. The generation of the vector is absolutly no problem.

Example:

# Compute signatures for two images
$cvec1 = puzzle_fill_cvec_from_file('img1.jpg');
$cvec2 = puzzle_fill_cvec_from_file('img2.jpg');

# Compute the distance between both signatures
$d = puzzle_vector_normalized_distance($cvec1, $cvec2);

# Are pictures similar?
if ($d < PUZZLE_CVEC_SIMILARITY_LOWER_THRESHOLD) {
  echo "Pictures are looking similar\n";
} else {
  echo "Pictures are different, distance=$d\n";
}

Thats all clear to me - but now how do i work when i have a big amount of pictures >1.000.000? I calculate the vector and store it with the filename in the database? How to find the similar pictures now? If i store every vector in the mysql i have to open each record and calculate the distance with the puzzle_vector_normalized_distance function. That procedures takes alot of time (open each database entry - put it throw the function ,...)

I read the readme from the lib puzzle libaray and found the following:

Will it work with a database that has millions of pictures?

A typical image signature only requires 182 bytes, using the built-in compression/decompression functions.

Similar signatures share identical “words”, ie. identical sequences of values at the same positions. By using compound indexes (word + position), the set of possible similar vectors is dramatically reduced, and in most cases, no vector distance actually requires to get computed.

Indexing through words and positions also makes it easy to split the data into multiple tables and servers.

So yes, the Puzzle library is certainely not incompatible with projects that need to index millions of pictures.

Also i found this description about indexing:

------------------------ INDEXING ------------------------

How to quickly find similar pictures, if they are millions of records?

The original paper has a simple, yet efficient answer.

Cut the vector in fixed-length words. For instance, let's consider the following vector:

[ a b c d e f g h i j k l m n o p q r s t u v w x y z ]

With a word length (K) of 10, you can get the following words:

[ a b c d e f g h i j ] found at position 0 [ b c d e f g h i j k ] found at position 1 [ c d e f g h i j k l ] found at position 2 etc. until position N-1

Then, index your vector with a compound index of (word + position).

Even with millions of images, K = 10 and N = 100 should be enough to have very little entries sharing the same index.

Here's a very basic sample database schema:

+-----------------------------+
| signatures |
+-----------------------------+
| sig_id | signature | pic_id |
+--------+-----------+--------+

+--------------------------+
| words |
+--------------------------+
| pos_and_word | fk_sig_id |
+--------------+-----------+

I'd recommend splitting at least the "words" table into multiple tables and/or servers.

By default (lambas=9) signatures are 544 bytes long. In order to save storage space, they can be compressed to 1/third of their original size through the puzzle_compress_cvec() function. Before use, they must be uncompressed with puzzle_uncompress_cvec().

I think that compressing is the wrong way cause then i have to uncompress every vector before comparing it.

My question is now - whats the way to handle millions of pictures and how to compare them in a fast and efficient way. I cant understand how the "cutting of the vector" should help me with my problem.

Many thanks - maybe i can find someone here which is working with the libpuzzle libaray.

Cheers.

Jason · Answer 1 · 2016-11-04T14:16:01.717

So, let's take a look at the example they give and try to expand.

Let's assume you have a table that stores information relating to each image (path, name, description, etc). In that table, you'll include a field for the compressed signature, calculated and stored when you initially populate the database. Let's define that table thus:

CREATE TABLE images (
    image_id INTEGER NOT NULL PRIMARY KEY,
    name TEXT,
    description TEXT,
    file_path TEXT NOT NULL,
    url_path TEXT NOT NULL,
    signature TEXT NOT NULL
);

When you initially compute the signature, you're also going to compute a number of words from the signature:

// this will be run once for each image:
$cvec = puzzle_fill_cvec_from_file('img1.jpg');
$words = array();
$wordlen = 10; // this is $k from the example
$wordcnt = 100; // this is $n from the example
for ($i=0; $i<min($wordcnt, strlen($cvec)-$wordlen+1); $i++) {
    $words[] = substr($cvec, $i, $wordlen);
}

Now you can put those words into a table, defined thus:

CREATE TABLE img_sig_words (
    image_id INTEGER NOT NULL,
    sig_word TEXT NOT NULL,
    FOREIGN KEY (image_id) REFERENCES images (image_id),
    INDEX (image_id, sig_word)
);

Now you insert into that table, prepending the position index of where the word was found, so that you know when a word matches that it matched in the same place in the signature:

// the signature, along with all other data, has already been inserted into the images
// table, and $image_id has been populated with the resulting primary key
foreach ($words as $index => $word) {
    $sig_word = $index.'__'.$word;
    $dbobj->query("INSERT INTO img_sig_words (image_id, sig_word) VALUES ($image_id,
        '$sig_word')"); // figure a suitably defined db abstraction layer...
}

Your data thus initialized, you can grab images with matching words relatively easily:

// $image_id is set to the base image that you are trying to find matches to
$dbobj->query("SELECT i.*, COUNT(isw.sig_word) as strength FROM images i JOIN img_sig_words
    isw ON i.image_id = isw.image_id JOIN img_sig_words isw_search ON isw.sig_word =
    isw_search.sig_word AND isw.image_id != isw_search.image_id WHERE
    isw_search.image_id = $image_id GROUP BY i.image_id, i.name, i.description,
    i.file_path, i.url_path, i.signature ORDER BY strength DESC");

You could improve the query by adding a HAVING clause that requires a minimum strength, thus further reducing your matching set.

I make no guarantees that this is the most efficient setup, but it should be roughly functional to accomplish what you're looking for.

Basically, splitting and storing the words in this manner allows you to do a rough distance check without having to run a specialized function on the signatures.

That's good information - thanks. Just to clarify, have you actully tried this - or is it only 'in theory'? Wont affect the bounty, but definitly interested in seeing a working implementation. In particular seems like your indexes might need tweaking to run efficient queries. — barryhunter, Mar 21 '12 at 14:28
It's theory, I've got no direct experience with libpuzzle, I just figured I'd provide some code to expand on the examples from the libpuzzle documentation mostly as an exercise. — Jason, Mar 21 '12 at 15:20
quick note... we actually implemented the (slightly modified) above... works like a charm! and... low behold, a bit more accurate then running the puzzle compare functions image vs image... so far we have experimented with strength of 20... and pretty much are getting 100% accurate results for our 4 million strong image base... thanks!!! — anonymous-one, Mar 25 '13 at 11:50
i've tried your code, but returned $words array with 100 item, and nothing value?! — TomSawyer, Oct 15 '13 at 03:35
I'd like to help, but I'm not entirely clear what you're saying. Is it that the `$words` array holds 100 entries, but each one is a blank string? If so, it's possible I made a typo, but it's been a year and a half since I've looked at this and don't remember all of the details... At any rate, try `$words[] = substr($cvec, $i, $wordlen);` and see if that sets you right (if not, I obviously need to look closer to understand what I wrote) — Jason, Oct 15 '13 at 21:09

score 3 · Answer 2 · edited May 23 '17 at 12:08

I've experimented with libpuzzle before - got about as far as you. Didnt really start on a proper implementation. Was also unclear how exactly to do it. (and abandoned the project for lack of time - so didnt really persist with it)

Anyway, looking now, will try to offer my understanding - maybe between us we can work it out :)

Queries use a 2 stage process -

first you use the words table.
1. take the 'reference' image and work out its signature.
2. work out its component words,
3. consult the words table to find all the possible matches. This can use the database engines 'indexes' for efficient queries.
4. compile a list of all sig_ids. (will get some duplicates in 3. )
Then consult the signatures table
1. retreive and decompress all possible from signatures (because you have a prefiltered list the number should be relatively small)
2. use puzzle_vector_normalized_distance to work out an actual distance.
3. sort and rank the results as required

(ie you only use compression on the signatures table. words remains uncompressed, so can run fast queries on it)

The words table is a form of inverted index. In fact I have in mind to use https://stackoverflow.com/questions/tagged/sphinx instead the words database table, because that is designed specifically as a very fast inverted index.

... in theory anyway...

score 1 · Answer 3 · answered Nov 02 '14 at 08:46

I am also working on libpuzzle in php and am having some doubts about how to generate the words from the image signatures. Jasons answer above seems right but I have a problem with this part:

// this will be run once for each image:
$cvec = puzzle_fill_cvec_from_file('img1.jpg');
$words = array();
$wordlen = 10; // this is $k from the example
$wordcnt = 100; // this is $n from the example
for ($i=0; $i<min($wordcnt, strlen($cvec)-$wordlen+1); $i++) {
    $words[] = substr($cvec, $i, $wordlen);
}

The signature vector is 544 letters long and with the above creation of words we are always using only the first 110 letters of it. Meaning we are indexing on behalf of the upper third of the image content if I understand this correctly.

If you read the original article (An Image Signature for any kind of Image) on which libpuzzle is based on, they explain that words should be generated "...possibly non-contiguous and overlapping". I am not sure if they mean non-contiguous and non-overlapping, or non-contiguous and overlapping...

But if they mean non-overlapping I guess the words should be spread out throughout the entire signature vector. It would also make more sense, as the vector is created by evaluating regions of the image left to right, top to bottom. And by spreading words across the entire vector would mean that you are considering the whole image rather just the upper part of it (if you generate all the words from the beginning of the vector).

Would love to hear how you guys understand this.

I'm not sure if anyone is still following this topic, but for anyone who might throw in an opinion... I've done some testing with the above mentioned approach — salamca, Nov 04 '14 at 17:01
The thing is that with my first couple of test images, which are very similar I get a distance of 0.544, which means that they should be ruled as almost the same. But with the above procedure of generating words from signatures from both images none of their words overlap. So these two images would be ruled as not similar already in the first step which would be wrong! — salamca, Nov 04 '14 at 17:06
Going back to the original article mentioned above, I read that for the "word step" to work in a statistical sense, they needed to lump together -1 with -2 and 1 with 2 in the word vector. I tried that and with k=10 and N=100 (taking words only from the beginning of the vector, which I am not sure is right), it barely works (one overlap on very similar images) Any thoughts on that would be greatly appreciated. — salamca, Nov 04 '14 at 17:10
You should [ask another question](https://stackoverflow.com/questions/ask) instead of *answering* here. Don't hesitate to copy text from this post. Since it is not an answer, it doesn't fit in Q&A format, and its removing from here is only a matter of time. — user, Nov 06 '14 at 21:09

score 0 · Answer 4 · answered May 08 '13 at 05:00

I've made a libpuzzle DEMO project on GitHub: https://github.com/alsotang/libpuzzle_demo.

The project use the way which Jason proposed above.

The database schema shows on: https://github.com/alsotang/libpuzzle_demo/blob/master/schema.sql

And I will give more information about libpuzzle's signature.

enter image description here

Now we have the two images, and let me calculate their signature.

enter image description here

The odd lines is for image 1(the left one), and the even lines is for image 2.

You can find that in most cases, the number in the same position is the same.

....

My english is so poor, so I cant express my mind continue...I think anyone who wanna index millions of images should inspect my GitHub repo of libpuzzle DEMO ..

can you convert it into php code? i don't know how to index signature into database — TomSawyer, Oct 15 '13 at 03:42

Libpuzzle Indexing millions of pictures?

4 Answers4

Linked