0

I have the need to store many data flows consisting of something like:

struct Flow {
    source: Address,
    destination: Address,
    last_seq_num_sent: u32,
    last_seq_num_rcvd: u32,
    last_seq_num_ackd: u32
}

I need to query by last_seq_num_rcvd. I can guarantee (with off-screen magic) the uniqueness of this field among all flows.

The flow may occur over unreliable connections, so some sequence numbers may get skipped due to network packet loss. I account for this by using a window, one which also guarantees uniqueness for its entire stretch. The rates of data flows are independent of each other, but have the ability to renumber their sequence numbers before collisions occur.

So the goal is to perform a range query against the flows to find any flow with a last_seq_num_rcvd within a WINDOW_SIZE constant's distance of some given next sequence number.

I gather the BTreeMap is appropriate here for its range query ability.

const WINDOW_SIZE = 10;
struct FlowValue { /* All original fields, minus last_seq_num_rcvd which now acts as key */ }

let mut flows = BTreeMap<u32, FlowValue>::new();

let query = 42;
for (k, v) in flows.range(Excluded(query), Included(query + WINDOW_SIZE)) {
    // This is how I would query for a flow
}

But now my key is something that changes often. It seems like there's no efficient way to update it in-place; it requires full deletion and reinsertion (under incremented key), which sounds like an expensive operation.

Is the BTreeMap method too expensive? Is there an alternative data structure that isn't? Or could I overload the BTreeMap to actually perform an efficient in-place increment of an integer key?

armani
  • 93
  • 1
  • 10
  • 23
  • If I've understood correctly, the objective is to find the next available sequence number, i.e. the maximal `last_seq_num_rcvd` plus `WINDOW_SIZE`? In which case, surely one only need track that number somewhere independently from the collection of flows? – eggyal Feb 24 '22 at 03:33
  • @eggyal No, the objective is to identify the `FlowValue` that matches any new incoming packets by passing its sequence number in a range check. – armani Feb 24 '22 at 06:43
  • Is there a big amount of churn? I mean do you have a lot of creations of new `Flow`s and deletions of old ones, or do you have a (mostly) fixed number of `Flow`s? In the latter case, the best solution is probably a sorted `Vec` since updating a `Flow` should not change its place by much and searching can be done efficiently with [`binary_search`](https://doc.rust-lang.org/1.54.0/std/vec/struct.Vec.html#method.binary_search). Of course as with all performance-related questions, **you should measure under realistic loads** to confirm. – Jmb Feb 24 '22 at 07:57
  • @Jmb It churns minimally as you assume. I can see a way to range check using `binary_search` but that gives me an index for a `Vec` of sequence numbers. How would I associate the index to the respective flow data? – armani Feb 24 '22 at 08:57
  • Upon further reading, it seems like [binary_search_by_key](https://doc.rust-lang.org/1.54.0/std/vec/struct.Vec.html#method.binary_search_by_key) might get me there. – armani Feb 24 '22 at 09:14

1 Answers1

1

You're right that a B-Tree map is a little expensive for this application.

Since the window size is constant, a faster implementation would be to partition the sequence numbers into buckets of size about WINDOW_SIZE/2. Then just put the flows into a hash table according to their rcvd bucket.

To find flows for a particular packet, then, you only need to look up the 3 buckets that could possibly contain matching flows, and test each flow in the buckets. This will be faster than a B-Tree lookup.

On update, the situation is even better, because you only need to update the hash table when an entry changes buckets, and that only happens every once every WINDOW_SIZE/2 packets.

Matt Timmermans
  • 53,709
  • 3
  • 46
  • 87