So here is the scenario. I have an unsorted array (very large) called gallery which contains pairs of templates (std::vector<uint8_t>
) and their associated IDs (std::string
).
I have a function in which I am provided with a template and must return the IDs of the k
most similar templates in my gallery (I am using cosine similarity to generate a similarity score between the templates).
I considered using a heap as discussed in this post.
However, the issue is that gallery can contain multiple different templates which belong to a single ID. In my function, I must return k
unique IDs.
For context, I am doing a facial recognition application. I can have multiple different templates belonging to a single person in my gallery (the person was enrolled in the gallery using multiple different images, hence multiple templates belonging to their ID). The search function should return the k
most similar people to the provided template (and thus not return the same ID more than once).
Would appreciate an efficient algorithm for doing so in C++.
Edit: Code snipped for my proposed solution with heap (does not deal with duplicates properly)
std::priority_queue<std::pair<double, std::string>, std::vector<std::pair<double, std::string> >, std::greater<> > queue;
for(const auto& templPair : m_gallery) {
try{
double similairty = computeSimilarityScore(templPair.templ, idTemplateDeserial);
if (queue.size() < candidateListLength) {
queue.push(std::pair<double, std::string>(similairty, templPair.id));
} else if (queue.top().first < similairty) {
queue.pop();
queue.push(std::pair<double, std::string>(similairty, templPair.id));
}
} catch(...) {
std::cout << "Unable to compute similarity\n";
continue;
}
}
// CandidateListLength number of IDs with the highest scores will be in queue
Here is an example to hopefully help. For the sake of simplicity, I will assume that the similarity score has already been computed for the templates.
Template 1: similarity score: 0.4, ID: Cyrus
Template 2: similarity score: 0.5, ID: James
Template 3: similarity score: 0.9, ID: Bob
Template 4: similarity score: 0.8, ID: Cyrus
Template 5: similarity score: 0.7, ID: Vanessa
Template 6: similarity score: 0.3, ID: Ariana
Getting the IDs of the top 3 scoring templates will return [Bob, Cyrus, Vanessa]