Database indexing and lookup with "closest neighbour" not exact match

Question

I'm dealing with an interesting issue.

I have biometric system that uses John Daugman's algorithm to transform human irises into binary code (for some research at our university).

The iris code is "flat" (it's not stored as a circle, but transformed into rectangle):

column 1 | column 2 | column 3 | ...

10011001 ...
10110111
01100010
...

Where column represents 30 bits. The problem is that each scan of iris has its own noise mask (eye lids, reflections...) and matches aren't 100% but at best around 96-98%.

So far we are using algorithm like this (Hamming Distance matching):

mask = mask1 & mask2;
result = (code1 ^ code2) & mask;

// ration of 1 bits allowed by mask
double difference = (double)one_bits(result)/one_bits(mask);

The problem with that we are now building real database of irises (around 1200-1300 subject, each 3-5 iris samples and you have to count in rotation so you need to make around 10 tests for each). And we need to compare current sample against whole database (65 000 comparisons on 80*30 bits) which turns out to be slow.

Question: is there any hash functions which reflects data structure (and changes just a bit when few bit changes) or is "error tolerant"? We need to build fast search algorithm in the whole database (so we are looking for possible ways to index this).

UPDATE: I guess it should be implemented by some sort of "closest neighbour" lookup, or use some sort of clustering (where similar irises would be grouped and in first round only some representatives would be checked).

Hash functions are generally intended to produce drastically different result even for slight changes in input, so you're not likely to find a suitable hash function. — lanzz, Oct 28 '12 at 11:43
@lanzz I know that... It even took me a week to find out that what I'm looking for is called ~"closest neighbor matching"... The terminology here is quite new for me :-S — Vyktor, Oct 28 '12 at 12:02
If I understand correct, each iris is represented using 80*30 bits (80 columns, 30 bits per column). And two irises (`a` and `b`) are considered match if the hamming distance between `a` and `b` is less than 2400bits*(1-96%) = 96bits? Is there any dependency between the columns? — greeness, Oct 31 '12 at 19:02
@greeness not that I know about... never actually tested the difference localization within image — Vyktor, Nov 01 '12 at 17:31
Is what I understand in my previous comment correct (about the bits and hamming distance and match criteria)? If it is, you can certainly use locality sensitive hashing. — greeness, Nov 01 '12 at 18:34

score 5 · Accepted Answer · edited May 23 '17 at 12:00

5

Check Locality Sensitive Hashing (LSH), implementations like this.

"A nilsimsa code is something like a hash, but unlike hashes, a small change in the message results in a small change in the nilsimsa code. Such a function is called a locality-sensitive hash."

How to understand Locality Sensitive Hashing?

edited May 23 '17 at 12:00

Community

1
1

answered Oct 31 '12 at 17:53

greeness

15,956
5
50
80

if you are going to create tag wikis that contain text copied verbatim from a source, please include a link at the end of the tag wiki to the source article. – LittleBobbyTables - Au Revoir Oct 31 '12 at 18:22

Database indexing and lookup with "closest neighbour" not exact match

1 Answers1