0

I have got big Data file loaded in Spark but wish to work on a small portion of it to run the analysis, is there any way to do that ?. I tried doing repartition but it brings a lot of reshuffling. Is there any good of way of processing the only small chunk of a Big file loaded in Spark?.

mrsrinivas
  • 34,112
  • 13
  • 125
  • 125
jack
  • 349
  • 4
  • 11
  • use filter operation to select the desire portion after filter operation you will only have small portion then you can work on it. You can also use limit – Akash Sethi Mar 15 '17 at 05:07

2 Answers2

3

In short

You can use sample() or randomSplit() transformations on RDD

sample()

/**
  * Return a sampled subset of this RDD.
  *
  * @param withReplacement can elements be sampled multiple times
  * @param fraction expected size of the sample as a fraction of this RDD's size
  *  without replacement: probability that each element is chosen; fraction must be [0, 1]
  *  with replacement: expected number of times each element is chosen; fraction must be 
  *  greater than or equal to 0
  * @param seed seed for the random number generator
  *
  * @note This is NOT guaranteed to provide exactly the fraction of the count
  * of the given [[RDD]].
  */

  def sample(
      withReplacement: Boolean,
      fraction: Double,
      seed: Long = Utils.random.nextLong): RDD[T]

Example:

val sampleWithoutReplacement = rdd.sample(false, 0.2, 2)

randomSplit()

/**
  * Randomly splits this RDD with the provided weights.
  *
  * @param weights weights for splits, will be normalized if they don't sum to 1
  * @param seed random seed
  *
  * @return split RDDs in an array
  */

def randomSplit(
   weights: Array[Double],
   seed: Long = Utils.random.nextLong): Array[RDD[T]]

Example:

val rddParts = randomSplit(Array(0.8, 0.2)) //Which splits RDD into 80-20 ratio
mrsrinivas
  • 34,112
  • 13
  • 125
  • 125
2

You can use any of the following RDD API's :

  1. yourRDD.filter(on some condition)
  2. yourRDD.sample(<with replacement>,<fraction of data>,<random seed>)

Ex: yourRDD.sample(false, 0.3, System.currentTimeMillis().toInt)

If you want any random fraction of data I suggest you use second method. Or if you need part of the data satisfying some condition use the first one.

Ravi Teja
  • 377
  • 4
  • 15