Questions tagged [rdd]

Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce.

Resilient Distributed Datasets (a.k.a RDDs) are a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce.

RDDs provide a highly restricted form of shared memory: they are read-only datasets that can only be constructed through bulk operations on other RDDs.

RDD is the primary data abstraction in Apache Spark and the core of Spark (that I often refer to as "Spark Core").

Warning: Please notice that RDD API is a very low-level construction, and isn't recommended to use in the modern versions of Apache Spark. Please use DataFrame/DataSet API instead.

The features of RDDs (decomposing the name):

Resilient, i.e. fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures.

Distributed with data residing on multiple nodes in a cluster.

Dataset is a collection of partitioned data with primitive values or values of values, e.g. tuples or other objects (that represent records of the data you work with).

For more information:

  1. Mastering-Apache-Spark : RDD tutorial

  2. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing by Matei Zaharia, et al.

4052 questions
39
votes
4 answers

Calculating the averages for each KEY in a Pairwise (K,V) RDD in Spark with Python

I want to share this particular Apache Spark with Python solution because documentation for it is quite poor. I wanted to calculate the average value of K/V pairs (stored in a Pairwise RDD), by KEY. Here is what the sample data looks like: >>>…
NYCeyes
  • 5,215
  • 6
  • 57
  • 64
38
votes
2 answers

How spark read a large file (petabyte) when file can not be fit in spark's main memory

What will happen for large files in these cases? 1) Spark gets a location from NameNode for data . Will Spark stop in this same time because data size is too long as per information from NameNode? 2) Spark do partition of data as per datanode block…
Arpit Rai
  • 391
  • 1
  • 4
  • 5
37
votes
2 answers

How to convert Spark RDD to pandas dataframe in ipython?

I have a RDD and I want to convert it to pandas dataframe. I know that to convert and RDD to a normal dataframe we can do df = rdd1.toDF() But I want to convert the RDD to pandas dataframe and not a normal dataframe. How can I do it?
user2966197
  • 2,793
  • 10
  • 45
  • 77
36
votes
2 answers

How to extract an element from a array in pyspark

I have a data frame with following type: col1|col2|col3|col4 xxxx|yyyy|zzzz|[1111],[2222] I want my output to be following type: col1|col2|col3|col4|col5 xxxx|yyyy|zzzz|1111|2222 My col4 is an array and I want to convert it to a separate column.…
AnmolDave
  • 395
  • 2
  • 4
  • 6
36
votes
2 answers

Concatenating datasets of different RDDs in Apache spark using scala

Is there a way to concatenate datasets of two different RDDs in spark? Requirement is - I create two intermediate RDDs using scala which has same column names, need to combine these results of both the RDDs and cache the result for accessing to UI.…
Atom
  • 768
  • 1
  • 15
  • 35
35
votes
2 answers

Pyspark: repartition vs partitionBy

I'm working through these two concepts right now and would like some clarity. From working through the command line, I've been trying to identify the differences and when a developer would use repartition vs partitionBy. Here is some sample…
Joe Widen
  • 2,378
  • 1
  • 15
  • 21
34
votes
3 answers

How to sort an RDD in Scala Spark?

Reading Spark method sortByKey : sortByKey([ascending], [numTasks]) When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the…
blue-sky
  • 51,962
  • 152
  • 427
  • 752
33
votes
2 answers

Would Spark unpersist the RDD itself when it realizes it won't be used anymore?

We can persist an RDD into memory and/or disk when we want to use it more than once. However, do we have to unpersist it ourselves later on, or does Spark does some kind of garbage collection and unpersist the RDD when it is no longer needed? I…
MetallicPriest
  • 29,191
  • 52
  • 200
  • 356
31
votes
3 answers

Apache Spark: Splitting Pair RDD into multiple RDDs by key to save values

I am using Spark 1.0.1 to process a large amount of data. Each row contains an ID number, some with duplicate IDs. I want to save all the rows with the same ID number in the same location, but I am having trouble doing it efficiently. I create an…
smli
  • 345
  • 1
  • 4
  • 6
29
votes
3 answers

How to get element by Index in Spark RDD (Java)

I know the method rdd.firstwfirst() which gives me the first element in an RDD. Also there is the method rdd.take(num) Which gives me the first "num" elements. But isn't there a possibility to get an element by index? Thanks.e
progNewbie
  • 4,362
  • 9
  • 48
  • 107
29
votes
3 answers

Convert a simple one line string to RDD in Spark

I have a simple line: line = "Hello, world" I would like to convert it to an RDD with only one element. I have tried sc.parallelize(line) But it get: sc.parallelize(line).collect() ['H', 'e', 'l', 'l', 'o', ',', ' ', 'w', 'o', 'r', 'l',…
poiuytrez
  • 21,330
  • 35
  • 113
  • 172
28
votes
2 answers

Apache spark dealing with case statements

I am dealing with transforming SQL code to PySpark code and came across some SQL statements. I don't know how to approach case statments in pyspark? I am planning on creating a RDD and then using rdd.map and then do some logic checks. Is that the…
Amardeep Flora
  • 1,255
  • 6
  • 13
  • 29
28
votes
3 answers

Spark: Difference between Shuffle Write, Shuffle spill (memory), Shuffle spill (disk)?

I have the following spark job, trying to keep everything in memory: val myOutRDD = myInRDD.flatMap { fp => val tuple2List: ListBuffer[(String, myClass)] = ListBuffer() : tuple2List }.persist(StorageLevel.MEMORY_ONLY).reduceByKey { (p1,…
Edamame
  • 23,718
  • 73
  • 186
  • 320
28
votes
2 answers

Spark: Efficient way to test if an RDD is empty

There is not an isEmpty method on RDD's, so what is the most efficient way of testing if an RDD is empty?
Tobber
  • 7,211
  • 8
  • 33
  • 56
28
votes
4 answers

Join two ordinary RDDs with/without Spark SQL

I need to join two ordinary RDDs on one/more columns. Logically this operation is equivalent to the database join operation of two tables. I wonder if this is possible only through Spark SQL or there are other ways of doing it. As a concrete…
learning_spark
  • 669
  • 1
  • 8
  • 19