I am trying to figure out all sources of non-determinism in Spark. I understand that non-determinism can come from user provided functions e.g in a map(f) with f involving random. I am instead looking for the operations that can lead to non-determinism either in terms of transformations/actions of at a lower level e.g shuffling.
Asked
Active
Viewed 1,918 times
1 Answers
3
Off the top of my head:
operations which require shuffling (or network traffic in general) may output values in non-deterministic order. It includes obvious cases like
groupBy*
orjoin
. A less obvious example is an order of ties after sortingoperations which depend on the changing data sources or a mutable global state
side effects executed inside transformations, including
accumulator
updates

zero323
- 322,348
- 103
- 959
- 935
-
Can you give an example of a side effect inside a transformation? – savx2 Dec 09 '15 at 21:50
-
Communication with external system, writing to file, updating "global" executor state. – zero323 Dec 09 '15 at 21:52
-
1don't forget pretty much any operation that gets a timestamp or an environment variable, since they vary from node to node – Roberto Congiu Dec 10 '15 at 00:52