6

I have two lists, L1 and L2, of data containing multiple elements, each unique, of an abstract data type (ie: structs). Each of the two lists:

  • May contain between zero and one-hundred (inclusive) elements.
  • Contains no duplicate elements (each element is unique).
  • May or may not contain elements in the other list (ie: L1 and L2 might be identical, or contain completely different elements).
  • Is not sorted.
  • At the lowest level, is stored withing a std::vector<myStruct> container.

What I am typically expecting is that periodically, a new element is added to L2, or an element is subtracted/removed from it. I am trying to detect the differences in the two lists as efficiently (ie: with minimal comparisons) as possible:

  • If an entry is not present in L2 and is present in L1, carry out one operation: Handle_Missing_Element().
  • If an entry is present in L2 and not present in L1, carry out another operation: Handle_New_Element().

Once the above checks are carried out, L1 is set to be equal to L2, and at some time in the future, L2 is checked again.

How could I go about finding out the differences between the two lists? There are two approaches I can think of:

  1. Compare both lists via every possible combination of elements. Possibly O(n2) execution complexity (horrible).

bool found;
for i in 1 .. L2->length()
  found = false;
  for j in 1 .. L1->length()
    if (L1[j] == L2[i]
      // Found duplicate entry
      found = true;
    fi
  endfor
endfor
  1. Sort the lists, and compare the two lists element-wise until I find a difference. This seems like it would be in near-linear time. The problem is that I would need the lists to be sorted. It would be impractical to manually sort the underlying vector after each addition/removal for the list. It would only be reasonable to do this if it were somehow possible to force vector::push_back() to automatically insert elements such that insertions preseve the sorting of the list.

Is there a straightforward way to accomplish this efficiently in C++? I've found similar such problems, but I need to do more than just find the intersection of two sets, or do such a test with just a set of integers, where sum-related tricks can be used, as I need to carry out different operations for "new" vs "missing" elements.

Thank you.

Community
  • 1
  • 1
Cloud
  • 18,753
  • 15
  • 79
  • 153
  • 3
    Difficult to use `std::vector` in C. Suggest dropping `C` tag. – chux - Reinstate Monica Jul 28 '15 at 03:35
  • 1
    So, your lists are not really linked lists (as in `std::list`), but are actually arrays (as in `std::vector`)? – AnT stands with Russia Jul 28 '15 at 03:36
  • Do you have a comparison function for elements? (I mean `operator<`, not just `operator==`.) – Beta Jul 28 '15 at 03:38
  • @stgatilov Correct, L1 is constant. – Cloud Jul 28 '15 at 03:40
  • @AnT Correct. It's an array, though I could change it to a `std::list` rather than a `std::vector`. – Cloud Jul 28 '15 at 03:41
  • 1
    @Beta I do not have comparison functions. It's just a `struct` rather than a fully-defined `class` at this time. – Cloud Jul 28 '15 at 03:41
  • And *when* do you carry out the operations? Every time a common element is removed or a new element added? Or at some arbitrary future time, after some elements have come and gone? – Beta Jul 28 '15 at 03:49
  • @Beta I carry out the operations at periodic intervals, so there's no way of knowing in advance how many additions/removals have been carried out. – Cloud Jul 28 '15 at 03:50
  • You could build a `skip list` out of the linked list. I think most good searches depend on some kind of sorted representation. A `skip list` is a fancy linked list with `nlogn()` search. – Matt Jul 28 '15 at 04:43

4 Answers4

4

Can you create a hash value for your list items? If so, just compute the hash and check the hash table for the other list. This is quick, does not require sorting, and prevents your "every possible combination" problem. If your're using C++ and the STL you could use a map container to hold each list.

  • Create a hash for each item in L1, and use map to map it associate it with your list item.
  • Create a similar map for L2, and as each L2 has is created check to see if it's in the L1 map.
  • When a new element is added to L2, calculate its hash value and check to see if it's in the L1 hash map (using map.find() if using STL maps). If not then carry out your Handle_New_Element() function.
  • When an element is subtracted from the L2 list and it's hash is not in the L1 hash map then carry out your Handle_Missing_Element() function.
Thane Plummer
  • 7,966
  • 3
  • 26
  • 30
  • Nice idea for determining if the lists are different or not. But it seems OP also has a requirement of finding which elements are missing. – kaylum Jul 28 '15 at 03:41
  • Thank you. This does let me detect the difference between the two lists, but I need to be able to find missing vs new elements, and make a distinction between the two. – Cloud Jul 28 '15 at 03:49
  • Actually I think you can detect missing elements... stand by and I'll update my answer. – Thane Plummer Jul 28 '15 at 03:57
  • It seems that your solution may be incorrect in case of hash collision. It may be not very important in practice, if the hashes are large, though. – stgatilov Jul 28 '15 at 04:37
  • 1
    Let hash table `X` and hash table `Y` store arbitrary types with a representation of magnitude. Let sorted sequence `Z` represent the differences between hash tables `X`, `Y`. That is, when you insert into `Y` also check `X` and if they're different, store the difference in `Z`. – Matt Jul 28 '15 at 04:51
  • @stgatilov You are correct that collisions are a problem, so a prudent choice of the hash algorithm is required. You can also deal with collisions by writing a compare function and/or storing a CRC or checksum in the data structure so that the you double check every matching hash. – Thane Plummer Jul 28 '15 at 13:38
4

It would be impractical to manually sort the underlying vector after each addition/removal for the list. It would only be reasonable to do this if it were somehow possible to force vector::push_back() to automatically insert elements such that insertions preseve the sorting of the list.

What you're talking about here is an ordered insert. There are functions in <algorithm> that allow you do do this. Rather than using std::vector::push_back you would use std::vector::insert, and call std::lower_bound which does a binary search for the first element not less than than a given value.

auto insert_pos = std::lower_bound( L2.begin(), L2.end(), value );
if( insert_pos == L2.end() || *insert_pos != value )
{
    L2.insert( insert_pos, value );
}

This makes every insertion O(logN) but if you are doing fewer than N insertions between your periodic checks, it ought to be an improvement.

The zipping operation might look something like this:

auto it1 = L1.begin();
auto it2 = L2.begin();

while( it1 != L1.end() && it2 != L2.end() )
{
    if( *it1 < *it2 ) {
        Handle_Missing( *it1++ );
    } else if( *it2 < *it1 ) {
        Handle_New( *it2++ );
    } else {
        it1++;
        it2++;
    }
}

while( it1 != L1.end() ) Handle_Missing( *it1++ );
while( it2 != L2.end() ) Handle_New( *it2++ );
paddy
  • 60,864
  • 6
  • 61
  • 103
  • 2
    Inserting in the middle of a vector takes **O(N)** time. – stgatilov Jul 28 '15 at 04:32
  • 1
    In practice, vector inserts are faster than lists for anything up to a fairly obscene-sized contained type. I think it would have helped if the OP had identified *why* they are maintaining these two lists. I would have suggested providing the operations in a queue and just rattling them off. That, or storing everything in a tree. – paddy Jul 28 '15 at 04:35
  • 1
    @paddy I'm keeping track of newly connected/disconnected microphones for an audio/DSP system, and need to either tell the underlying software to allocate buffers for a new mic, or to clean up and free the buffers for a mic no longer connected to the system. The only way I can uniquely identify mics is via a hard-coded UUID built into the hardware. At this time, I have no disconnect/connect event handling capabilities, and must rely on polling all connecting audio devices (potential mics). – Cloud Jul 28 '15 at 05:06
  • If it's a small struct, then you will be pleasantly surprised by the efficiency of inserting into a vector. Benchmarks that I've seen indicate that random inserts into a vector are better than a list if the contained type is less than about 1000 bytes. When you say UUID, I assume you're talking about something more on the order of 32 bytes. – paddy Jul 28 '15 at 05:17
  • 1
    @dogbert By the sound of it, you can probably just maintain L1 as a sorted vector (using ordered inserts), and get rid of L2 completely. When you enumerate the connected device UUIDs, you can binary search each one in `L1` (using `std::binary_search`) and then push it into an 'added' or 'deleted' vector. After enumeration, go through those vectors, call the appropriate handlers and update `L1`. – paddy Jul 28 '15 at 05:23
  • 1
    @paddy: if it is the sorted vector you suggested, wouldn't it be more natural to use std::set as the list contains no duplicated element. std::set is internally sorted by the comparison object. – simon Jul 28 '15 at 06:09
  • 1
    @simon That's correct, but the memory layout is different. Using a vector improves cache locality. It's hard to know how frequently the OP is polling. For all we know, this might be happening hundreds of times per second. Also, vector could be given a modest amount of reserve such that allocations are never required under normal operating conditions. This could be perceived as premature optimisation of course. Using a set could be a perfectly valid solution. – paddy Jul 28 '15 at 06:25
3

A container that automatically sorts itself on inserts is std::set. Insertions will be O(log n), and comparing the two sets will be O(n). Since all your elements are unique you don't need std::multiset.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
2

For each element of both arrays maintain number of times it is met in the opposite array. You can store these numbers in separate arrays with same indexing, or in the structs you use.

When an element x is inserted into L2, you have to check it for equality with all the elements of L1. On each equality with y, increment counters of both elements x and y.

When an element x is removed from L2, you have to again compare it with all the elements of L1. On each equality with y from L1, decrement counter of y. Counter of x does not matter, since it is removed.

When you want to find non-duplicate elements, you can simply iterate over both arrays. The elements with zero counters are the ones you need.

In total, you need O(|L1|) additional operations per insert and remove, and O(|L1| + |L2|) operations per duplication search. The latter can be reduced to the number of sought-for non-duplicate elements, if you additionally maintain lists of all elements with zero counter.

EDIT: Ooops, it seems that each counter is always either 0 or 1 because of uniqueness in each list.

EDIT2: As Thane Plummer has written, you can additionally use hash table. If you create a hash table for L1, then you can do all the comparisons in insert and remove in O(1). BTW since your L1 is constant, you can even create a perfect hash table for it to make things faster.

stgatilov
  • 5,333
  • 31
  • 54