[Resolved]
We have a few old services within our cluster and needed to update one where it consumes and processes two extra messages. The messages are built in the exact same way and are being consumed the exact same way.
After the service was running with more then one partition we started seeing random FabricNotReadableExceptions
. We spent a long time investigating the issue.
Identifying the problem -
1: Look at a single partition.
2: We saw Node0
being Primary.
3: Node0
became a Secondary, message processing was cancelled.
4: Node1
became a Primary which started consuming and processing messages.
5: For some reason Node0
was still receiving messages on the same partition and throwing exceptions when trying to access Reliable State.
We use the standard Service Fabric Remoting with custom partitioning. This has been working on multiple services so far and never had an issue.