0

My data structure needs three operations:

  • insert an element at a random place in the ordering
  • find and remove smallest element
  • (rarely) delete an element by via some key returned at insert time

The existing code is a single-linked list and does a linear search to find an insert point. O(n).

Finding and removing the smallest element is trivial: pull off and dispose of the head link. O(1).

The insert returns a pointer to the link, and the delete call gets that pointer. Were it a double-linked list the link could simply be deleted. O(1). Alas the list is single-linked, and the list is searched for the node of this address, so it's O(n). This search is expensive, but it does allow detection of an attempt to remove a node twice in some cases: attempted deletion of a node simply not on the list won't find it so won't do anything except generate a warning in the log. On the other hand the nodes are stored in a LIFO memory pool, so are likely to be reused, so an accidental re-deletion of a node may well remove some other node instead.)

OK, with a heap, the insert is O(log n). Delete of minimum is O(log n). Both simple.

But what of delete-by-key? If I keep the heap in an array, it's basically a linear search, O(n). I move the elements around in the heap to keep the heap property (bubbling down and up as needed), so I can't just use the node's address. Plus, unless you accept a fixed maximum size, you need to reallocate the array which typically moves it.

I'm thinking maybe the heap could be an array of POINTERS to the actual nodes, which live elsewhere. Each node would have it's array index in it, and as I move pointers-to-nodes around in the heap, I'd update the node with its new array index. Thus a request to delete a node could supply me with the node. I use the node's stored index into the heap, and delete that pointer, so now log(N). It just seems far more complicated.

Given the extra overhead of allocating non-moving nodes separately, and keeping their array index field updated, sounds like it might be more than some very occasional number of linear searches. OTOH, an advantage of keeping nodes separate from the array heap is that it's faster to swap pointers than whole nodes (which in my case may be 32 bytes or more).

Any simpler ideas?

Swiss Frank
  • 1,985
  • 15
  • 33
  • 1
    I think, you can use just ordinary tree structure, each node contains 3 pointers: (up, rson, lson). Since such nodes does not move in the memory, you can use direct pointer as a "some key returned at insert time" to delete a node. If you would like care about tree balance - just implement rb-tree or so. – olegarch Apr 05 '20 at 03:45
  • Certainly an idea. I'd have to care about tree balance, so would need a full red black tree. I have code for one that's faster than STL both for -g and -O2 so could use that, modifying the nodes to carry the actual payload. But it's more like a 1000-line solution instead of a 20-line solution. – Swiss Frank Apr 05 '20 at 04:16
  • 1
    FWIW: I agree that shuffling pointers-to-nodes rather than shuffling 32 (or more) byte nodes is (for any machine I can think of) a win... and the cost of writing a new heap-index to each node is small compared to that. (And, of course, for the node being moved up or down the heap, you only need to write the heap-index when it reaches its final position.) I have done just this for a heap of timers, to allow a timer to be cancelled and also to allow the timer to be updated -- shuffling it up or down the heap as required. Extending the heap is also less painful. – Chris Hall Apr 05 '20 at 11:27
  • If your keys are reasonably small, then you can store the heap in an array (as is traditionally done), and store an extra array *P* mapping keys to heap positions (thus P[k] is an integer containing the position in the heap/array where key *k* is found). When you swap elements up and down in the heap, you also swap the corresponding values in *P*. This is commonly done in [Dijkstra's algorithm](https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm) in order to implement the decrease-key operation. – Cătălin Frâncu Apr 06 '20 at 13:09
  • 1
    I'd suggest that you do the simplest thing first, and not worry about optimizing delete. You said yourself that delete isn't something you do very often ("some very occasional number of linear searches"). If what you build performs to your satisfaction, then you're done. Why work harder than you have to? – Jim Mischel Apr 06 '20 at 14:29

2 Answers2

1

OK. You can keep in the memory:

  • Your data structures, contains payload and priority, dynamically allocated. Address of a structure (pointer) is removal key.
  • Keep and maintain binary heap, contains pointers to these data structures. But, heapify elements according priorities, contains within structs, by pointers.

Thus, your algorithms:

  • Insert: Create data struct, deploy pointer to the heap, rebalance the heap : Log(N)

  • Delete last: Take last element from the heap, delete struct, delete last element from the heap, rebalance heap: Log(N).

  • Delete random element: Get pointer for element, by pointer get priority, search in the heap this element by priority: Log(n). Delete struct, delete pointer from the heap. Rebalance the heap - again Log(N).

olegarch
  • 3,670
  • 1
  • 20
  • 19
  • Close, but searching the heap by element is impossible (or at least, O(N)). Consider this: 1) half the elements in the tree are the bottom level. 2) an extremely low-priority key could be any of those bottom level items. Even for a mid-level priority, you still won't know which branch it is in. – Swiss Frank Apr 11 '20 at 09:12
  • Yes, if several elements has same PriorityID - there is need to iterate over all elements with such priorities, while you found correct by pointer comparison. However, you can make all elements distinct, by creating piorityID, contains some serial number, for example: uint64_t PriorityID = (your_priority << 32) | counter++; as result, all priorityIDs will be distinct within your heap. – olegarch Apr 11 '20 at 14:28
  • Even if they're unique, how do you know which branch they're on? Say my priority is low number first. Say my array contains pointers with priority [ 1 3 2 4 6 5 7 8 9 10 11 12 13 14 15]. I'm looking for the 13. How would I know that to get there from the root, I need to go right to value 2, left to value 5, then right to 13? That 13 could be a grandchild of either of the root's children, and a child of either of the root's children's children, no? – Swiss Frank Apr 11 '20 at 17:50
0

The data nodes are allocated upon queuing, deallocated upon dequeue, and not moved. When you queue data, your return value is this node's address, though to keep the API clean the return value's type is opaque.

The heap is a heap of pointers to these nodes.

The data nodes have an unsigned int which holds the current array offset of the pointer to them.

When we move the pointer (due to a heap operation such as insert or delete) we update it's nodes index. For instance, if we determine our key is higher priority than our parent, we move our parent to our current position like this:

apnode[i] = apnode[iParent];
apnode[i]->iOffset = i;

Inserts of data with any key, and Deletes of node with highest-priority key, work normally. Both are O( log n ).

The new operation, to delete a node that is not yet highest-priority, involves casting the opaque key back to a pointer, dereferencing to get the corresponding pointer's current offset, then deleting that. Thus, getting to the pointer to delete is a very fast O(1). At that point, deleting it normally is the usual O( log n ).

The extra overhead to support this random delete amounts to setting that offset. It's only one line of code but on the other hand substantially increases the set of cache lines touched, compared to a pointer heap without this feature. On the other hand, the heap is probably substantially faster for being a pointer heap than actually having the array elements be the nodes themselves.


Almost totally off-topic but very useful for anyone reading this:

All textbooks seem to present heap operations as follows: add your item to the end of the heap, then do swaps to heapify it down to where it belongs. Each of those swaps is three instructions, though. It's more efficient to instead consider a pointer to the end of the heap as the candidate destination (CD) for the new data, but not write it there yet.

Then, compare the new data's key with the parent. If the new data is lower or equal priority, write it at CD and done. Otherwise, simply copy the parent to CD, and parent's address becomes new CD. Repeat. This cuts the actual data moves by 2/3rds.

Swiss Frank
  • 1,985
  • 15
  • 33