How do I store data with a query that's a approximated?

Question

I'm trying to find a way to store my data with fast access (better than O(n)).

My database consists of data (4096 byte strings) that represents some information about some items.
The problem is, that the query is never exact. I get one Item, and then need to find the closest match using a function F(a,b).

just an example:

1234
3456
6466
F(a,b) = return % of similar digits  

GetClosest(1233,F) = 1234

The problem is that F(a,b) is a complicated algorithm, (not a proper metric).

What I have now is just go over the whole database to search for the best match.
Is there a kind of tree or other cluster database type that can give me faster finding complexity ?

More information:

F gives back a similarity value in %percentage. where 100% is a perfect match.

Is it possible to re-arrange/ store data wrt indices before the actual retrieval process? — NirmalGeo, May 10 '11 at 13:15

score 1 · Accepted Answer · answered May 10 '11 at 14:44

Sorry, the answer is "probably not" unless there is some more structure to your problem that you haven't described. With 4096 byte strings you're suffering from the curse of dimensionality.

If you had shorter strings and enough data that there was a high likelihood of the nearest match being identical over a large chunk of the string, then you could store your data with multiple tree-like structures indexed over different chunks of the string. With high likelihood the nearest would be close enough that you could prove it was nearest based only on close elements in those trees. However with the size of your strings and the limited data that can be stored in a computer, there is no way this is possibly going to work.

That said, do you need the exact closest, or only a somewhat close one? If only a likely close one, then you could index it by several random sparse samples of bits. In your search you can only check elements that match exactly in one of the elements. This will greatly reduce the search space, while rejecting fewer of the close neighbors, and may produce reasonable (even though frequently wrong) answers.

score 0 · Answer 2 · answered May 10 '11 at 09:34

0

Is there some way you could assign a 'score' to each datum.

You could index/sequence the data by your score.

When you search you assign a score to your search criteria, and look for the item with the closest score.

Depends very much on your data and your definition of "difference" whether this will work.

answered May 10 '11 at 09:34

James Anderson

27,109
7
50
78

I can't score them. It's not transitive, it's a similarity score. If I score the whole database according to similarity to A, it won't help with finding the similarity to B. – Yochai Timmer May 10 '11 at 09:39
Ah well I did say it depended on your data. Maybe someone could suggest a solution involving trees or some variation on a Bayesian algorithm. – James Anderson May 10 '11 at 09:46

How do I store data with a query that's a approximated?

2 Answers2