4

As I know there are two types of dependencies: narrow & wide. But I dont understand how dependency affects to child RDD. Is child RDD only metadata which contains info how to build new RDD blocks from parent RDD? Or child RDD is self-sufficient set of data which was created from parent RDD?

Speise
  • 789
  • 1
  • 12
  • 28

1 Answers1

8

Yes, the child RDD is metadata that describes how to calculate the RDD from the parent RDD.

Consider org/apache/spark/rdd/MappedRDD.scala for example:

private[spark]
class MappedRDD[U: ClassTag, T: ClassTag](prev: RDD[T], f: T => U)
  extends RDD[U](prev) {

  override def getPartitions: Array[Partition] = firstParent[T].partitions

  override def compute(split: Partition, context: TaskContext) =
    firstParent[T].iterator(split, context).map(f)
}

When you say rdd2 = rdd1.map(...), rdd2 will be such a MappedRDD. compute is only executed later, for example when you call rdd2.collect.

An RDD is always such a metadata, even if it has no parents (for example sc.textFile(...)). The only case an RDD is stored on the nodes, is if you mark it for caching with rdd.cache, and then cause it to be computed.

Another similar situation is calling rdd.checkpoint. This function marks the RDD for checkpointing. The next time it is computed, it will be written to disk, and later access to the RDD will cause it to be read from disk instead of recalculated.

The difference between cache and checkpoint is that a cached RDD still retains its dependencies. The cached data can be discarded under memory pressure, and may need to be recalculated in part or whole. This cannot happen with a checkpointed RDD, so the dependencies are discarded there.

Daniel Darabos
  • 26,991
  • 10
  • 102
  • 114
  • 1
    Thank you for clear answer. But for clarifying - As I understand new transformed RDD is only new set of pointers to some (filtered) blocks of old data? Or the new transformed RDD is the new copy of old data? Im interested in physical things which are happens in cluster with RDD during transformation. – Speise Feb 14 '15 at 11:14
  • 3
    RDDs are lazy. None of the work is performed until an eager action (like `collect` or `reduce`) is executed. When an action is ultimately executed, the operations like `map` and `filter` are performed as a chain of iterators. The important point is that an RDD does not typically represent _data_, it represents _calculation_. – Daniel Darabos Feb 14 '15 at 13:58
  • 2
    Ok but lets imagine that we have Spark job with next steps of calculations: (1)RDD - > (2)map->(3)filter->(4)collect. At the first stage we have input RDD, at the second stage we transform these RDD to map(kay-value pairs). So what is the result of Spark at the third stage during filtering? Will Spark just remove unnecessary items from RDD? Or it will create absolutly new RDD with necessary items and remove the previouse one? What will be with bunch of items of parent RDD wich is unnecessary after filtering? – Speise Feb 14 '15 at 15:09
  • 2
    The RDDs are implemented with iterators. So the input file will be read, and line-by-line the map function will be applied, then the filter function. Never will more than one line be stored. (Well, they will hang around in memory until the garbage collector cleans them up.) The exception is `collect`, which calls `iterator.toArray` to turn the results into an array and sends these back to the application. – Daniel Darabos Feb 14 '15 at 15:39