0

I know reservoir sampling can be applied in parallel, but spark seems use the other sampling methods I have no idea about. could someone describe them briefly?

According to @Tristan answer, I guess the purpose of not using reservoir sampling is to keep the balance of classes. But I go though the source code and found noting about labels.

cstur4
  • 966
  • 2
  • 8
  • 21

1 Answers1

-1

I know the existence of Stratified sampling

RoyaumeIX
  • 1,947
  • 4
  • 13
  • 37
  • You may also check out this link : https://databricks.com/blog/2014/08/27/statistics-functionality-in-spark.html – RoyaumeIX May 25 '16 at 06:04