3

How to oversample a dataframe in pyspark?

df.sample(fractions, seed)

Which only sample a fraction of the df, it can't oversample.

grizzthedj
  • 7,131
  • 16
  • 42
  • 62
Stevven
  • 31
  • 1
  • 3
  • By oversample, do you mean increase the number of samples as compared to original? if yes, how do you plan to do that, by duplicating records or by applying some oversampling algorithm? – mayank agrawal Mar 13 '18 at 10:18
  • Define what you mean by "oversample". Try to provide an [mcve] if it's appropriate. – pault Mar 13 '18 at 14:17

1 Answers1

1

You could over-sample by making use of the sample method as follows:

df.sample(withReplacement=True, total_percent_of_upsample, seed)

sample(withReplacement, fraction, seed=None)

The True indicates that you want to sample with replacement.

Tshilidzi Mudau
  • 7,373
  • 6
  • 36
  • 49