I am running a stratified sample on a dataset, in which the sample I keep on a dataframe called df. When running a count on df, everytime I run the count (without re-running the stratified sampling), it gives me different count as if every time I do an operation on df, my data gets re-sampled. I have a seed set as 12 and I use the spark function sampleBy.
I am pretty new in Spark, is this normal? How do I counteract this issue?