Finding the K UNIQUE largest elements in an unsorted array of pairs

Question

So here is the scenario. I have an unsorted array (very large) called gallery which contains pairs of templates (std::vector<uint8_t>) and their associated IDs (std::string).

I have a function in which I am provided with a template and must return the IDs of the k most similar templates in my gallery (I am using cosine similarity to generate a similarity score between the templates).

I considered using a heap as discussed in this post. However, the issue is that gallery can contain multiple different templates which belong to a single ID. In my function, I must return k unique IDs.

For context, I am doing a facial recognition application. I can have multiple different templates belonging to a single person in my gallery (the person was enrolled in the gallery using multiple different images, hence multiple templates belonging to their ID). The search function should return the k most similar people to the provided template (and thus not return the same ID more than once).

Would appreciate an efficient algorithm for doing so in C++.

Edit: Code snipped for my proposed solution with heap (does not deal with duplicates properly)

    std::priority_queue<std::pair<double, std::string>, std::vector<std::pair<double, std::string> >, std::greater<> > queue;


    for(const auto& templPair : m_gallery) {
        try{
            double similairty = computeSimilarityScore(templPair.templ, idTemplateDeserial);

            if (queue.size() < candidateListLength) {
                queue.push(std::pair<double, std::string>(similairty, templPair.id));
            } else if (queue.top().first < similairty) {
                queue.pop();
                queue.push(std::pair<double, std::string>(similairty, templPair.id));
            }
        } catch(...) {
            std::cout << "Unable to compute similarity\n";
            continue;
        }
    }
// CandidateListLength number of IDs with the highest scores will be in queue

Here is an example to hopefully help. For the sake of simplicity, I will assume that the similarity score has already been computed for the templates.

Template 1: similarity score: 0.4, ID: Cyrus

Template 2: similarity score: 0.5, ID: James

Template 3: similarity score: 0.9, ID: Bob

Template 4: similarity score: 0.8, ID: Cyrus

Template 5: similarity score: 0.7, ID: Vanessa

Template 6: similarity score: 0.3, ID: Ariana

Getting the IDs of the top 3 scoring templates will return [Bob, Cyrus, Vanessa]

Use a max-heap and instead of discarding top IDs, put them into `std::set` and continue until your set's `size()` is `k`? — Fureeish, Sep 13 '19 at 23:42
So if I put the ID in a set, it will tell me if the ID is already in the max heap which is good. However, I would also need to modify the score value for that given ID in the queue (granted the new similarity score is greater than the similarity score already in the queue). — cyrusbehr, Sep 13 '19 at 23:46
I don't quite understand. You said in your question that you have pairs of a *value* and an *ID*. You have a function that describes similarity between two *values*, which can be used to order the elements. You say that you want to retrieve `k` unique *ID*s that correspond to the most similar *value*s. Where did you mention that you then would need to alter some data? Regardless, instead of an `std::set`, you may use `std::map` with *ID*s as **keys** and pointers to your pairs as **values**, but that assumes that I understood you correctly. Can you please provide a sample input and output? — Fureeish, Sep 13 '19 at 23:51
Sorry in my response I meant to modify the score value in the **max-heap**, not queue. I will add an edit to my question to add the code in question. — cyrusbehr, Sep 13 '19 at 23:58
Unfortunately, the code you've shown does not shine much light on the duplicates issue. Once again, please provide a sample input and output. You say that you want `k` unique IDs. Does that mean that ultimately you can be interested in any number (`> k`) of templates that have combined exactly `k` unique IDs? — Fureeish, Sep 14 '19 at 00:09
Great, last question - what if `Template 1` had `similarity score: 0.85`? How would that change the output? — Fureeish, Sep 14 '19 at 00:18
Output would still be the same in that case [Bob, Cyrus, Vanessa] (in that order) — cyrusbehr, Sep 14 '19 at 00:21
Then I believe my first comment proposes a correct solution. If nobody will give you a satisfying answer in ~24h, I will try to come up with my own. — Fureeish, Sep 14 '19 at 00:32

Maras · Answer 1 · 2019-09-14T00:33:35.913

Use std::set structure (balanced BST) instead of heap. It also puts elements in order, lets you find the largest and the smallest element inserted. In addition, it automatically detects a duplicate when using insert function and ignores it, so each element inside will always be unique. Complexity is exactly the same (it is a bit slower though because of a larger constant).

Edit: I probably did not understand the question properly. As far as I can see you can have multiple elements with different values which should be considered a duplicate.

What I would do:

Make a set with pairs (template value, ID)
Make a map where key is an ID and value is a template value of a template currently in the set.
If you want to add a new template:
- If it's ID is in the map - you have found a duplicate. If it has worse value than the one paired with the ID in the map, do nothing, otherwise delete a pair (old value, ID) from the set and insert (new value, ID), change value in the map to the new one.
- If it's not in the map just add it to both map and set.
When you have too many items in the set, just delete the worst one from both set and map.

cyrusbehr · Accepted Answer · 2019-09-18T21:44:23.973

Implemented the answer outline by Maras. It seems to do the job.

#include <iostream>
#include <vector>
#include <map>
#include <utility>
#include <string>
#include <set>

int main() {
    int K = 3;

    std::vector<std::pair<double, std::string>> data {
        {0.4, "Cyrus"},
        {0.5, "James"},
        {0.9, "Bob"},
        {0.8, "Cyrus"},
        {0.7, "Vanessa"},
        {0.3, "Ariana"},
    };

    std::set<std::pair<double, std::string>> mySet;
    std::map<std::string, double> myMap;

    for (const auto& pair: data) {
        if (myMap.find( pair.second ) == myMap.end()) {
            // The ID is unique
            if (mySet.size() < K) {
                // The size of the set is less than the size of search candidates
                // Add the result to the map and the set
                mySet.insert(pair);
                myMap[pair.second] = pair.first;
            } else {
                // Check to see if the current score is larger than the worst performer in the set
                auto worstPairPtr = mySet.begin();

                if (pair.first > (*worstPairPtr).first) {
                    // The contender performed better than the worst in the set
                    // Remove the worst item from the set, and add the contender
                    // Remove the corresponding item from the map, and add the new contender
                    mySet.erase(worstPairPtr);
                    myMap.erase((*worstPairPtr).second);
                    mySet.insert(pair);
                    myMap[pair.second] = pair.first;
                }
            }

        } else {
            // The ID already exists
            // Compare the contender score to the score of the existing ID.
            // If the contender score is better, replace the existing item score with the new score
            // Remove the old item from the set
            if (pair.first > myMap[pair.second]) {
                mySet.erase({myMap[pair.second], pair.second});
                mySet.insert(pair);
                myMap[pair.second] = pair.first;
            }

        }
    }

    for (auto it = mySet.rbegin(); it != mySet.rend(); ++it) {
        std::cout << (*it).second << std::endl;
    }

}

The output is

Bob
Cyrus
Vanessa

Finding the K UNIQUE largest elements in an unsorted array of pairs

2 Answers2