Spark re-samples my data everytime I run something related to the sample

Question

I am running a stratified sample on a dataset, in which the sample I keep on a dataframe called df. When running a count on df, everytime I run the count (without re-running the stratified sampling), it gives me different count as if every time I do an operation on df, my data gets re-sampled. I have a seed set as 12 and I use the spark function sampleBy.

I am pretty new in Spark, is this normal? How do I counteract this issue?

Please paste some code to reproduce. It will help people to better answer your question — Marsellus Wallace, Mar 14 '19 at 18:37

mkaran · Answer 1 · 2019-03-13T20:27:21.350

1

It is a bit hard to tell for sure without the code but, If you don't cache/ persist your data frame anywhere, then spark will re-run everything up to the point where you call an action like .count(). So, if you are sampling your data at some point with a random seed, then the sampling will re-run, thus the different result.

You can use df = df.cache() or df = df.persist() e.g. when you first load the data and right after the sampling to have spark create a sort-of a break point and not re-run everything.

link to documentation

I hope this helps, good luck!

edited Mar 13 '19 at 20:27

answered Mar 13 '19 at 19:48

mkaran

2,528
20
23

Thanks, that is really helpful! I am sill wondering why I get different samples even though I use the same seed. I though using a seed meant getting the same sample everytime? – mblume Mar 14 '19 at 18:55
Unfortunately, I cannot tell without the code and sample data :), perhaps it could be that the data loading is done differently? – mkaran Mar 15 '19 at 07:14

Spark re-samples my data everytime I run something related to the sample

1 Answers1

Linked