Example
My distributed event-sourced system simulates houses being built and purchased over a period of time. For simplicity sake, we will use the year as the distributed clock value (forgetting vector clocks for now).
Houses take 1 year to build in version 1 of the system, but take twice as long in version 2. This is a change in logic rather than structure.
To cope with this change, events recorded in version 1 must also be replayed by version 1 when rebuilding state/snapshots. When version 2 of the log is reached, the application switches over to version 2 of the logic and continues replaying the residual events. A valid snapshot is built.
Problem
The nodes in my distributed system will be updated to version 2 at different times, creating a window whereby multiple versions are running simultaneously. My current understanding is this window can only be reduced through techniques like feature switching, but cannot be completely removed (unless you sacrifice availability by bringing the entire system down for an upgrade).
This creates a problem when merging the event logs from distributed nodes. The event versions bleed into each other, making it impossible to simply upgrade from version 1 to 2 during the replay. E.g.:
Node Clock Event
... pre-merge ...
A 2000 HouseBuildStarted('Alpha')
A 2001 HousePurchased('Alpha') <- 'HouseBuilt' event is implicit (inferred through logic).
A 2002 NodeUpgradedTo('V2')
B 2002 HouseBuildStarted('Bravo')
B 2003 HousePurchased('Bravo')
B 2004 NodeUpgradedTo('V2')
... post-merge ...
A 2000 HouseBuildStarted('Alpha')
A 2001 HousePurchased('Alpha')
B 2002 HouseBuildStarted('Bravo')
A 2002 NodeUpgradedTo('V2')
B 2003 HousePurchased('Bravo') <- 'Bravo' does not exist yet (1 year early)
B 2004 NodeUpgradedTo('V2')
How is this usually handled in systems where taking all the nodes down is not acceptable?