What is RDD in spark

Question

Definition says:

RDD is immutable distributed collection of objects

I don't quite understand what does it mean. Is it like data (partitioned objects) stored on hard disk If so then how come RDD's can have user-defined classes (Such as java, scala or python)

From this link: https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch03.html It mentions:

Users create RDDs in two ways: by loading an external dataset, or by distributing a collection of objects (e.g., a list or set) in their driver program

I am really confused understanding RDD in general and in relation to spark and hadoop.

Can some one please help.

score 46 · Accepted Answer · edited Aug 04 '17 at 14:44

46

An RDD is, essentially, the Spark representation of a set of data, spread across multiple machines, with APIs to let you act on it. An RDD could come from any datasource, e.g. text files, a database via JDBC, etc.

The formal definition is:

RDDs are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operators.

If you want the full details on what an RDD is, read one of the core Spark academic papers, Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

edited Aug 04 '17 at 14:44

Willian Fuks

11,259
10
50
74

answered Dec 23 '15 at 10:14

Ewan Leith

1,655
11
10

When the data is already distributed in RDD. what does partitioning mean? where distributed can also mean partitioned? – kittu Dec 23 '15 at 11:03
5

@kittu The data is distributed in partitions, you should audit this course [Introduction to Big Data with Apache Spark](https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x), there is something more a visible way to realize how the data is partitioned is by using the method `glom` of `RDD` – Alberto Bonsanto Dec 23 '15 at 11:15
1

I think it would be more correct to say that RDD is a representation of a [DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph) of instructions telling how to retrieve data and what to do with it. An RDD is a "lazy" representation of your data. It is similar to a sql execution plan. – YoYo Aug 19 '16 at 07:56

tharindu_DG · Answer 2 · 2018-11-14T01:53:24.973

19

RDD is a logical reference of a dataset which is partitioned across many server machines in the cluster. RDDs are Immutable and are self recovered in case of failure.

dataset could be the data loaded externally by the user. It could be a json file, csv file or a text file with no specific data structure.

UPDATE: Here is the paper what describe RDD internals:

Hope this helps.

edited Nov 14 '18 at 01:53

answered Dec 23 '15 at 10:05

tharindu_DG

8,900
6
52
64

@tharindu_DG I don't get the *with no data ordering* part. `RDD` and dataset in general may significantly rely on the element order. – Odomontois Dec 23 '15 at 11:46
@Odomontois: I wanted to say about the data structure. CSV are semi structured and normal text files are not structured. I corrected answer Sorry about my english and thanks for pointing out. – tharindu_DG Dec 23 '15 at 12:07
1

@kittu: According to my experience, you don't need to know all about RDDs to learn spark. Just enough to know the basic features of a RDD. When you do some implementations with the spark API, you'll understand. – tharindu_DG Dec 23 '15 at 12:34
@tharindu_DG Thanks that is what I am looking for. I need basic understanding so I can get my hands dirty. So one quick question i.e. spark+cassandra is data analytics right so it means I can build graphs/charts with it? or I am thinking in wrong direction? – kittu Dec 23 '15 at 12:37
@kittu: yes. spark supports several data input sources. Cassandra is one such source. – tharindu_DG Dec 23 '15 at 14:07
@VolkanGüven. Fixed. Thanks – tharindu_DG Nov 14 '18 at 01:53

Mahesh · Answer 3 · 2017-01-04T08:00:59.560

Formally, an RDD is a read-only, partitioned collection of records. RDDs can only be created through deterministic operations on either (1) data in stable storage or (2) other RDDs.

RDDs have the following properties –

Immutability and partitioning: RDDs composed of collection of records which are partitioned. Partition is basic unit of parallelism in a RDD, and each partition is one logical division of data which is immutable and created through some transformations on existing partitions.Immutability helps to achieve consistency in computations.

Users can define their own criteria for partitioning based on keys on which they want to join multiple datasets if needed.
Coarse grained operations: Coarse grained operations are operations which are applied to all elements in datasets. For example – a map, or filter or groupBy operation which will be performed on all elements in a partition of RDD.
Fault Tolerance: Since RDDs are created over a set of transformations , it logs those transformations, rather than actual data.Graph of these transformations to produce one RDD is called as Lineage Graph.

For example –

firstRDD=sc.textFile("hdfs://...")

secondRDD=firstRDD.filter(someFunction);

thirdRDD = secondRDD.map(someFunction);

result = thirdRDD.count()

In case of we lose some partition of RDD , we can replay the transformation on that partition in lineage to achieve the same computation, rather than doing data replication across multiple nodes.This characteristic is biggest benefit of RDD , because it saves a lot of efforts in data management and replication and thus achieves faster computations.

Lazy evaluations: Spark computes RDDs lazily the first time they are used in an action, so that it can pipeline transformations. So , in above example RDD will be evaluated only when count() action is invoked.
Persistence: Users can indicate which RDDs they will reuse and choose a storage strategy for them (e.g., in-memory storage or on Disk etc.)

These properties of RDDs make them useful for fast computations.

pgirard · Answer 4 · 2016-10-14T20:15:56.120

Resilient Distributed Dataset (RDD) is the way Spark represents data. The data can come from various sources :

Text File
CSV File
JSON File
Database (via JBDC driver)

RDD in relation to Spark

Spark is simply an implementation of RDD.

RDD in relation to Hadoop

The power of Hadoop reside in the fact that it let users write parallel computations without having to worry about work distribution and fault tolerance. However, Hadoop is inefficient for the applications that reuse intermediate results. For example, iterative machine learning algorithms, such as PageRank, K-means clustering and logistic regression, reuse intermediate results.

RDD allows to store intermediate results inside the RAM. Hadoop would have to write it to an external stable storage system, which generate disk I/O and serialization. With RDD, Spark is up to 20X faster than Hadoop for iterative applications.

Futher implementations details about Spark

Coarse-Grained transformations

The transformations applied to an RDD are Coarse-Grained. This means that the operations on a RDD are applied to the whole dataset, not on its individual elements. Therefore, operations like map, filter, group, reduce are allowed, but operations like set(i) and get(i) are not.

The inverse of coarse-grained is fine-grained. A fine-grained storage system would be a database.

Fault Tolerant

RDD are fault tolerant, which is a property that enable the system to continue working properly in the event of the failure of one of its components.

The fault tolerance of Spark is strongly linked to its coarse-grained nature. The only-way to implement fault tolerance in a fine-grained storage system is to replicate its data or log updates across machines. However, in a coarse-grained system like Spark, only the transformations are logged. If a partition of an RDD is lost, the RDD has enough information the recompute it quickly.

Data storage

The RDD is "distributed" (separated) in partitions. Each partitions can be present in the memory or on the disk of a machine. When Spark wants to launch a task on a partition, he sends it to the machine containing the partition. This is know as "locally aware scheduling".

Sources : Great research papers about Spark : http://spark.apache.org/research.html

Include the paper suggested by Ewan Leith.

score 6 · Answer 5 · answered Nov 08 '16 at 00:20

RDD = Resilient Distributed Dataset

Resilient (Dictionary meaning) = (of a substance or object) able to recoil or spring back into shape after bending, stretching, or being compressed

RDD is defined as (from LearningSpark - OREILLY): The ability to always recompute an RDD is actually why RDDs are called “resilient.” When a machine holding RDD data fails, Spark uses this ability to recompute the missing partitions, transparent to the user.

This means 'data' is surely available at all times. Also, Spark can run without Hadoop and hence data is NOT replicated. One of the best characterstics of Hadoop2.0 is 'High Availbility' with the help of Passive Standby Namenode. The same is achieved by RDD in Spark.

A given RDD (Data) can span across various nodes in Spark cluster (like in Hadoop based cluster).

If any node crashes, Spark can re-compute the RDD and loads the data in some other node, and data is always available. Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel (http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds)

Saketh · Answer 6 · 2015-12-23T11:08:28.793

3

To compare RDD with scala collection, below are few differences

Same but runs on a cluster
Lazy in nature where scala collections are strict
RDD is always Immutable i.e., you can not change the state of the data in the collection
RDD are self recovered i.e., fault-tolerant

edited Dec 23 '15 at 11:08

answered Dec 23 '15 at 10:25

Saketh

93
6

what I mentioned there is default nature of scala collection though we can make it lazy by specifying lazy like, `lazy val l= List(10, 20);` – Saketh Dec 23 '15 at 11:05
`Stream` is already lazy in that sense for example, also every `.view` is lazy in very similar to `RDD` sense – Odomontois Dec 23 '15 at 11:32

score 1 · Answer 7 · answered Jun 27 '18 at 21:36

RDD (Resilient Distributed Datasets) are an abstraction for representing data. Formally they are a read-only, partitioned collection of records that provides a convenient API.

RDD provide a performant solution for processing large datasets on cluster computing frameworks such as MapReduce by addressing some key issues:

data is kept in memory to reduce disk I/O; this is particularly relevant for iterative computations -- not having to persist intermediate data to disk
fault-tolerance (resilience) is obtained not by replicating data but by keeping track of all transformations applied to the initial dataset (the lineage). This way, in case of failure lost data can always be recomputed from its lineage and avoiding data replication again reduces storage overhead
lazy evaluation, i.e. computations are carried out first when they're needed

RDD's have two main limitations:

they're immutable (read-only)
they only allow coarse-grained transformations (i.e. operations that apply to the entire dataset)

One nice conceptual advantage of RDD's is that they pack together data and code making it easier to reuse data pipelines.

Sources: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, An Architecture for Fast and General Data Processing on Large Clusters

score 1 · Answer 8 · answered Jul 13 '18 at 11:02

RDD is a way of representing data in spark.The source of data can be JSON,CSV textfile or some other source. RDD is fault tolerant which means that it stores data on multiple locations(i.e the data is stored in distributed form ) so if a node fails the data can be recovered. In RDD data is available at all times. However RDD are slow and hard to code hence outdated. It has been replaced by concept of DataFrame and Dataset.

score 0 · Answer 9 · answered Jan 21 '20 at 13:29

RDD is an Resilient Distributed Data Set. It is an core part of spark. It is an Low Level API of spark. DataFrame and DataSets are built on top of RDD. RDD are nothing but row level data i.e. sits on n number of executors. RDD's are immutable .means you cannot change the RDD. But you can create new RDD using Transformation and Actions

score 0 · Answer 10 · answered Aug 20 '21 at 15:20

Resilient Distributed Datasets (RDDs)

Resilient: If an operation is lost while performing on a node in spark, the dataset can be reconstituted from history.

Distributed: Data in RDDs is divided into one or many partitions and distributed as in-memory collections of objects across worker nodes in the cluster.

Dataset: RDDs are datasets that consist of records, records are uniquely identifiable data collections within a dataset.

What is RDD in spark

10 Answers10

Futher implementations details about Spark

Coarse-Grained transformations

Fault Tolerant

Data storage

Linked

Related