My situation
Say I have thousands of objects, which in this example could be movies.
I parse these movies in a lot of different ways, collecting parameters, keywords and statistics about each of them. Let's call them keys. I also assign a weight to each key, ranging from 0 to 1, depending on frequency, relevance, strength, score and so on.
As an example, here are a few keys and weights for the movie Armageddon:
"Armageddon"
------------------
disaster 0.8
bruce willis 1.0
metascore 0.2
imdb score 0.4
asteroid 1.0
action 0.8
adventure 0.9
... ...
There could be a couple of thousands of these keys and weights, and for clarity, here's another movie:
"The Fast and the Furious"
------------------
disaster 0.1
bruce willis 0.0
metascore 0.5
imdb score 0.6
asteroid 0.0
action 0.9
adventure 0.6
... ...
I call this a fingerprint of a movie, and I want to use them to find similar movies within my database.
I also imagine it will be possible to insert something other than a movie, like an article or a Facebook profile, and assign a fingerprint to it if I wanted to. But that shouldn't affect my question.
My problem
So I have come this far, but now comes the part I find tricky. I want to take the fingerprint above and turn it into something easily comparable and fast. I tried creating an array, where index 0
= disaster
, 1
= bruce willis
, 2
= metascore
and their value is the weight.
It comes out something like this for my two movies above:
[ 0.8 , 1.0 , 0.2 , ... ]
[ 0.1 , 0.0 , 0.5 , ... ]
Which I have tried comparing in different ways, by just multiplying:
public double CompareFingerprints(double[] f1, double[] f2)
{
double result = 0;
if (f1.Length == f2.Length)
{
for (int i = 0; i < f1.Length; i++)
{
result += f1[i] * f2[i];
}
}
return result;
}
or comparing:
public double CompareFingerprints(double[] f1, double[] f2)
{
double result = 0;
if (f1.Length == f2.Length)
{
for (int i = 0; i < f1.Length; i++)
{
result += (1 - Math.Abs(f1[i] - f2[i])) / f1.Length;
}
}
return result;
}
and so on.
These have returned a very satisfying results, but they all have one problem in common: They work great for comparing two movies, but in reality, it's quite time consuming and feels like very bad practice when I want to compare a single movies fingerprint with thousands of fingerprints stored in my MSSQL database. Specially if it's supposed to work with things like autocomplete where I want to return the results in fractions of a second.
My question
Do I have the right approach here or am I reinventing the wheel in a really inefficient way? I hope my question isn't to broad for Stack Overflow, but I have narrowed it down with a few thoughts below.
A couple of thoughts
- Should my fingerprint really be an array of weights?
- Should I look into hashing my fingerprint? It might help with fingerprint storage, but complicate comparison. I have found some hints that this might be a valid approach, by using Locality-sensitive hashing, but the math is a bit out of my reach.
- Should I fetch all thousands of movies from SQL and work with the result, or is there a way to implement my comparison into an SQL query and only return the top 100 hits?
- Is sparse data representation something to look into? (Thanks Speed8ump)
- Could I apply methods used when comparing actual fingerprints or for OCR?
- I have heard that there is software that detects exam cheating by finding similarities in thousands of published papers and previous tests. What method do they use?
Cheers!