Data structure that allows efficient equality checks for closely related sequences (common subsequences)

Question

I'm looking for a data structure that I can use for Snapshot (a sequence) in the following example:

val oldSnapshot = Snapshot(3, 2, 1)
val newSnapshot = 4 +: (oldSnapshot dropRight 1) // basically a shift

// newSnapshot is now passed to another function that also knows the old snapshot

// this check should be fast if true
if (newSnapshot.tail == (oldSnapshot dropRight 1))

Background: I need an immutable data structure that stores a snapshot of the last n items that appeared in a stream. It is updated when a new item appears in the stream and the oldest item is dropped so the length is always at most n and the snapshots resemble a sliding window on the last n elements. In rare cases the stream can be interrupted and restarted. In case of a restart, the stream first emits at least n older elements before it continues to emit new "live" elements. However, some elements may be different, so I cannot be sure that a new snapshot of the recent history can be derived from an older snapshot just by appending new elements.

I further have a component that consumes a stream of these snapshots and does some incremental processing of the elements. It might for instance keep track of the sum of the elements. For a new snapshot it has to decide whether it was derived by appending one or a few elements to the end of the last known snapshot (and dropping the oldest elements) so it doesn't have to process all the old items again but can reuse some intermediate results.

By far the most common case is that the snapshot was shifted to include a single new element and the oldest was dropped. This case should be recognized really fast. If I would keep track of the whole history of elements and not drop the oldest, I could use a List and just compare the tail of the new list to the last known list. Internally, it would compare object identity and in most of the cases this would be enough to see that the lists are identical.

I'm thinking about using a Vector or a similar data structure for the snapshots and I'm wondering if such comparisons would also be guaranteed to be efficient in this sense or whether there is perhaps a better suited data structure that internally uses object identity checks for subcollections to quickly determine wheter two instances are identical.

Does this help you: [Maximum Length for scala queue](https://stackoverflow.com/q/6918731/2359227)? — Tomer Shetah, Dec 20 '20 at 16:37
Not exactly. My problem is not that I can't limit the length of the sequence but that I need fast equality checks for sequences that are equal. — lex82, Dec 20 '20 at 19:38
`Vector` is probably not going to be a great choice as its structural sharing (which would enable object equality on subcollections) is very coarse (it's a 32-ary tree, after all). You probably are going to need a custom structure that's `List`like but supports the "are the last _n_ elements of this the first _n_ elements of that" operation (i.e. one that stores direct references to the sublists with a common head at each node): it's not going to be memory-efficient at all, but for small _n_... *shrug*. — Levi Ramsey, Dec 20 '20 at 22:02
Note also that object identity is only going to help you if the producer and consumer are in the same JVM and the data is never serialized (or if the consumer is interning the snapshots)... IOW, I'd advise not putting the effort into this unless you're sure that the gain from sometimes being able to optimize is worth the definite cost of a scalability loss (or unless this is being done as a pure intellectual exercise) — Levi Ramsey, Dec 20 '20 at 22:10
@LeviRamsey thanks for your assessment. Then I'd rather not use `Vector` but build something myself. Shouldn't be too hard anyway and if it is it's not a bad exercise after all. — lex82, Dec 21 '20 at 08:10
I think it's ok to rely on object identity like it's done for `List` since it will be always in the same JVM. The more interesting question is if my interface design of the consumer component makes sense. My idea is that I can keep the interface simple by just passing in the complete recent history snapshot, so it appears it doesn't need any internal state (and it really doesn't). However, internally it knows the data structure and can optimize by converting the snapshots to incremental changes again. I'll think about it an maybe post another question. — lex82, Dec 21 '20 at 08:19

Data structure that allows efficient equality checks for closely related sequences (common subsequences)

0 Answers0