1

I am running a stratified sample on a dataset, in which the sample I keep on a dataframe called df. When running a count on df, everytime I run the count (without re-running the stratified sampling), it gives me different count as if every time I do an operation on df, my data gets re-sampled. I have a seed set as 12 and I use the spark function sampleBy.

I am pretty new in Spark, is this normal? How do I counteract this issue?

mblume
  • 243
  • 1
  • 3
  • 11

1 Answers1

1

It is a bit hard to tell for sure without the code but, If you don't cache/ persist your data frame anywhere, then spark will re-run everything up to the point where you call an action like .count(). So, if you are sampling your data at some point with a random seed, then the sampling will re-run, thus the different result.

You can use df = df.cache() or df = df.persist() e.g. when you first load the data and right after the sampling to have spark create a sort-of a break point and not re-run everything.

link to documentation

I hope this helps, good luck!

mkaran
  • 2,528
  • 20
  • 23
  • Thanks, that is really helpful! I am sill wondering why I get different samples even though I use the same seed. I though using a seed meant getting the same sample everytime? – mblume Mar 14 '19 at 18:55
  • Unfortunately, I cannot tell without the code and sample data :), perhaps it could be that the data loading is done differently? – mkaran Mar 15 '19 at 07:14