Creating a comparable and flexible fingerprint of an object

Question

My situation

Say I have thousands of objects, which in this example could be movies.

I parse these movies in a lot of different ways, collecting parameters, keywords and statistics about each of them. Let's call them keys. I also assign a weight to each key, ranging from 0 to 1, depending on frequency, relevance, strength, score and so on.

As an example, here are a few keys and weights for the movie Armageddon:

"Armageddon"
------------------
disaster       0.8
bruce willis   1.0
metascore      0.2
imdb score     0.4
asteroid       1.0
action         0.8
adventure      0.9
...            ...

There could be a couple of thousands of these keys and weights, and for clarity, here's another movie:

"The Fast and the Furious"
------------------
disaster       0.1
bruce willis   0.0
metascore      0.5
imdb score     0.6
asteroid       0.0
action         0.9
adventure      0.6
...            ...

I call this a fingerprint of a movie, and I want to use them to find similar movies within my database.

I also imagine it will be possible to insert something other than a movie, like an article or a Facebook profile, and assign a fingerprint to it if I wanted to. But that shouldn't affect my question.

My problem

So I have come this far, but now comes the part I find tricky. I want to take the fingerprint above and turn it into something easily comparable and fast. I tried creating an array, where index 0 = disaster, 1 = bruce willis, 2 = metascore and their value is the weight.

It comes out something like this for my two movies above:

[ 0.8 , 1.0 , 0.2 , ... ]
[ 0.1 , 0.0 , 0.5 , ... ]

Which I have tried comparing in different ways, by just multiplying:

public double CompareFingerprints(double[] f1, double[] f2)
{
    double result = 0;

    if (f1.Length == f2.Length)
    {
        for (int i = 0; i < f1.Length; i++)
        {
            result += f1[i] * f2[i];
        }
    }

    return result;
}

or comparing:

public double CompareFingerprints(double[] f1, double[] f2)
{
    double result = 0;

    if (f1.Length == f2.Length)
    {
        for (int i = 0; i < f1.Length; i++)
        {
            result += (1 - Math.Abs(f1[i] - f2[i])) / f1.Length;
        }
    }

    return result;
}

and so on.

These have returned a very satisfying results, but they all have one problem in common: They work great for comparing two movies, but in reality, it's quite time consuming and feels like very bad practice when I want to compare a single movies fingerprint with thousands of fingerprints stored in my MSSQL database. Specially if it's supposed to work with things like autocomplete where I want to return the results in fractions of a second.

My question

Do I have the right approach here or am I reinventing the wheel in a really inefficient way? I hope my question isn't to broad for Stack Overflow, but I have narrowed it down with a few thoughts below.

A couple of thoughts

Should my fingerprint really be an array of weights?
Should I look into hashing my fingerprint? It might help with fingerprint storage, but complicate comparison. I have found some hints that this might be a valid approach, by using Locality-sensitive hashing, but the math is a bit out of my reach.
Should I fetch all thousands of movies from SQL and work with the result, or is there a way to implement my comparison into an SQL query and only return the top 100 hits?
Is sparse data representation something to look into? (Thanks Speed8ump)
Could I apply methods used when comparing actual fingerprints or for OCR?
I have heard that there is software that detects exam cheating by finding similarities in thousands of published papers and previous tests. What method do they use?

Cheers!

since no one mentioned that, I guess it is wrong. but why not hashing ? — Moha the almighty camel, Feb 10 '14 at 13:32
I find hashing very interesting for this, but I'm not sure how to apply it to my scenario. — Magnus Engdal, Feb 10 '14 at 13:36
I had a similar situation, I will post in an answer what I did — Moha the almighty camel, Feb 10 '14 at 14:01
speaking of tags (in one of the answers below), adding a tag for algorithm or similar might get the right eyeballs looking at this question. — Nerf Herder, Feb 10 '14 at 14:51
@NerfHerder The [tag:data-mining] tag might also be appropriate. — Oblivious Sage, Feb 10 '14 at 22:53
I've just started reading about it, but your fingerprinting sounds like you are trying to create a [sparse data representation](https://www.google.com/search?q=sparse+data+representation). — Speed8ump, Feb 11 '14 at 01:03
what exactly do you mean by "similarity" ? how many keys are you expecting to have for each entry ? if not much, a kd-tree might work out for you. — Moha the almighty camel, Feb 11 '14 at 01:36
@Mhd.Tahawi I'm expecting thousands, but I haven't got it all figured out just yet. Depends on the solution. If every key, like a genre or an actor has a static position in my fingerprint, then it will be quite a lot. — Magnus Engdal, Feb 11 '14 at 08:01
@MagnusEngdal: let's us say you have a movie with jackie chan and other with bruce lee, are you going to have different keys for both actors ? maybe you should have a categories of actors instead of a a different key for each actor. if you were able to reduce the number of keys somehow, then you can think of your fingerprint as a vector in k-dimension space. and you might be able to try Kd-trees. this problem is very interesting, I hope this is not against the rules of stackoverflow, but is there a way we can contact each other ? — Moha the almighty camel, Feb 11 '14 at 12:05
@Mhd.Tahawi Magnus has sufficient rep to make a separate chat room where you can talk about it. — Oblivious Sage, Feb 11 '14 at 21:03
@MagnusEngdal Was reading through old Eric Lippert blogs and found a 5-post series that may be helpful to you: High-Dimensional Spaces Are Counterintuitive (all 5 posts are on [this page](http://blogs.msdn.com/b/ericlippert/default.aspx?PageIndex=20)). It discusses (in a roundabout way) a paper about how to quickly find hits near your candidate when you have a bunch of keys like this and a (relatively) sparse set of objects in the n-dimensional space defined by those n keys. — Oblivious Sage, Feb 11 '14 at 21:11

score 4 · Accepted Answer · answered Feb 11 '14 at 09:31

Alternative: Feature Vector

Whet you are describing is a classical feature vector. Each column in the feature vector describes a category. Your feature vector is a sepcial kind: It has fuzzy data, describing the degree of belonging to some category.

When processing such vectors, you should apply fuzzy logic for the calculations. With fuzzy logic you have to play areound a little bit, until you find the best numericla operators to match your fuzzy operations. E.g. fuzzy AND and OR could be computed with "min" and "max" or with "*" and "+" or even with more complex exponential operations. You have to find the right balance between good results and fast computations.

Unfortunately fuzzy logic does not fit very well with SQL databases. If you go the fuzzy way, you should consider to hold all your data in memory and to use some kind of numerical processing acceleration (processor SIMD instructions, CUDA/OpenCL, FPGA, etc.).

Alternative: Star / Snowflake Schema

A different approach is to build a classical data warehouse scheme. This fits well with modern SQL databases. They have nice accelerations to retrieve data from a medium sized data warehouse (up to a few billion records):

Materialized views (for data reduction)
(Compressed) bitmap indexes (for fast combining of multiple features)
Compressed storage (for fast transfer of huge amounts of date)
Pertitioning (to physically separete data according to their features)

To use these optimizations, you must first prepare your date.

Hierarchical dimensions

You should order your features hierarchical, according to the snowflake schema. When the data is ordered this way (and you have the according indexes), the database can use a new set of optimizations, e.g. bitmap filtering.

Data organized in this way should be mainly read only. The database will need data structures that are very fast for special kinds of queries but are also very expensive to update.

An example is a bitmap index. A bitmap index is a binary matrix. The rows of the matrix are the rows of one table in your database. The columns are the possible values of one row in this table. The entry in the matrix is 1, when the column in the according row in the table as the value according to the column of the matrix. Otherwise it is 0.

A bitmap index will be stored in a compressed binary format. For the database it is very easy to combine multiple bitmap indexes by using fast binary processing (by ANDing or ORing the binary values using in processor SIMD instructions or even OpenCL/CUDA, etc.).

There are special kind of bitmap indexes that can span multiple tables, so called bitmap join indexes. They are specially built for data organized in a snowflake schema.

Dimension reduction

You should also use dimension reduction to reduce the amount of features that must be stored. For this you can use techniques like principal component analysis. With this you can combine multiple highly coupled features to one artificial features and remove features completely that don't change their value at all.

Discrete dimension members

For fuzzy logic, using floating numbers is nice. But when storing data in a data warehouse, it is a good idea to reduce to possible values. Bitmap indexes and partitioning will only work with a limited amount of values. You can use classification algorithms to reach this, e.g. self organizing feature maps or particle swarm optimizations.

Alternative 3: Hybrid approach

You can easily combine the two approaches described above. You store the date in your data warehouse, using condensed descriptions (fewer dimensions, fewer members). Each data set contains the original features. When you retrieved the data sets from the data warehouse, you can then use the techniques from alternative 1 to operate with the full descriptions, e.g. to determine the top candidates for competition according to the current context.

This looks VERY promising. I'll have a closer look at it, but it certainly looks like the *correct* answer, if there is such a thing. — Magnus Engdal, Feb 11 '14 at 09:46

score 2 · Answer 2 · edited Jul 27 '15 at 18:54

Idea is cool, this way I can find all good (imdb > 5.5) movies with Bruce, where he play a main role (bruce willis > 0.9), which are actions (action > 0.5) and are not horrors (horror < 0.1). I hate horrors.

Your thoughts:

array of weights is bad, because if you get more and more keys, and if movie doesn't have this actor, then it still has to has a value (0), which is a waste of space (imagine million of keys attached to each movie).
hashing doesn't makes sense, as you are not going to access anything by exact value, you will always compare keys with user entered values and many of them will be optional (which means you don't care if they are 0 or 10).
depends, see below.

I think what you need here is a sort of Tag system (like SO one), where you can easily add new tags (to example, for new actors or when there will be something better than blue-ray or HD, etc). So a table with tag [id]-[name].

Then your movies have to have a field which stores a dictionary [id]-[score] of zero to million tags. This should be a blob (or is there any way to hold dictionary or array in SQL database?), or array (if your tag id starting from 0 and incremented by 1 you don't need key, but index).

When you are searching for movies, matching fingerprints conditions, you will have to read fingerprint from database for each movie. This should be slower than if SQL query would do it, but still ok (you will have maybe 100-1000 tags per movie, which makes it only a few KB to read), unless you have to transfer this data over network, then consider to use server application. Perhaps stored procedures can help.

I agree with the main points. I don't see how hashing would work with multiple keys like that. From an space efficiency standpoint, using tags for actors (maybe locations) might be useful and keep the rating for keys (action, tear jerker, etc.) but that might make the fingerprinting more complicated. — Nerf Herder, Feb 10 '14 at 14:46
@NerfHerder, it will not, that's why I discourage from using it. — Sinatr, Feb 10 '14 at 14:48
While a tag system is great in many ways, in fact I have one in place already, I'm looking for something that can include other things than actors, genres, titles and so on. Weights is crucial, and as shown in my example, meta score and imdb rating may be two of many non-tag keys making up the fingerprint. — Magnus Engdal, Feb 10 '14 at 17:49

score 1 · Answer 3 · edited May 23 '17 at 10:24

I think hashing is what you are looking for, a hash table gives you O(1) for insertion, deletion and search.
I had a similar situation where I had to hash an array of eight distinct integers. I used the following code from the C++ boost library.

size_t getHashValue ()const{

        size_t seed = 0;
        for (auto  v : board)
            seed ^= v + 0x9e3779b9 + (seed << 6) + (seed >> 2);

        return seed;


    }

my array was called board and this is the syntax for a foreach loop in C++, size_t is just an unsigned integer and the rest is the same as in C#.
note since I had distinct values, I can easily use the value itself as a hash function, that way I can guarantee a distinct hash value for each element in my array.

since that is not your case, you would need to change your code to include the hash of each entry in your array to build the hash of the entire array as follows:

foreach (float entry in array)
    // hashOf is something you would need to do 
    seed ^= hashOf(entry) + 0x9e3779b9 + (seed << 6) + (seed >> 2);

in case your entries have only one digit after the decimal point, you can multiply by 10 and move your problem to the integer domain. Hope this helps.

EDIT:

see this question for hashing decimal values: C# Decimal.GetHashCode() and Double.GetHashCode() equal.

the performance of this approach relays on the hashing function, the more equal the probability distribution of your function, the better performance you get. but IMHO hash table is the best you can get see this

Although hashing will work great for 'a perfect match', it will not help out at all to find a 'near match'. One of the properties of hashing algorithm is that even the smallest difference will cause a totally different hash. — deroby, Feb 10 '14 at 20:16
I like the concept of hashing for fingerprint storage, but it may complicate comparison. I don't think, as @deroby points out, that I can find partial matches when comparing two hashed fingerprints? — Magnus Engdal, Feb 11 '14 at 08:07

score 1 · Answer 4 · answered Feb 10 '14 at 14:46

Fingerprint Format
Regarding your 1st question, whether you should use an array of weights, that comes down to the level of detail you want. An array of weights will offer the highest fingerprint "resolution", for lack of a better term; it allows for a much more fine-grained measurement of how similar any two given movies are. Sinatr's suggestion of using tags in place of weights has a lot of optimization potential, but it essentially limits you to weights of 0 or 1, and thus has trouble representing existing weights in the 0.3-0.7 range. You'll have to decide for yourself whether the performance gains of going to a representation with less detail outweigh the reduced comparison accuracy those representations have.

Hashes
Regarding your 2nd question, I'm afraid I can't offer much guidance. I'm not familiar with the use of hashing in this sort of context, but I don't see how you could compare them easily; the whole point of hashes in most uses is that they can't easily be reversed to learn about the original input.

SQL Optimization
For your 3rd question, the SQL query you use to get comparison candidates is probably a rich source of performance optimization potential, especially if you know some characteristics of your fingerprints. In particular if high weights or low weights are relatively rare, then you can use those to weed out a lot of poor candidates. For example, if you were using movies you would expect a lot of the weights to be 0 (most movies do not contain Bruce Willis). You could look at any weights in your candidate movie that are higher than .8 or so (you'll need to do some fine-tuning to determine the exact values that work well for your data set) and then have your SQL query exclude results that have a 0 in at least some fraction of those keys (again, the fraction will need fine-tuning). This allows you to quickly discard results that are unlikely to be good matches in the SQL query stage rather than doing a full (expensive) comparison on them.

Other Options
Another approach that might work, depending on how often the fingerprints of your objects change, is to pre-compute the fingerprint comparison values. Then getting the best candidates is a single query from an indexed table: SELECT id1, id2, comparison FROM precomputed WHERE (id1 = foo OR id2 = foo) AND comparison > cutoff ORDER BY comparison DESC. Pre-computing the comparisons for a new object would be part of the process of adding it, so if being able to quickly add objects is a priority then this approach may not work well. Alternately, you could simply cache values once you've computed them, rather than pre-computing them. This doesn't do anything for the initial search, but later searches reap the benefits and adding objects stays cheap.

I really like your SQL insights, which seems to be close to a final approach. I'm evaluating it right now. — Magnus Engdal, Feb 11 '14 at 08:13