Questions tagged [rdd]

Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce.

Resilient Distributed Datasets (a.k.a RDDs) are a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce.

RDDs provide a highly restricted form of shared memory: they are read-only datasets that can only be constructed through bulk operations on other RDDs.

RDD is the primary data abstraction in Apache Spark and the core of Spark (that I often refer to as "Spark Core").

Warning: Please notice that RDD API is a very low-level construction, and isn't recommended to use in the modern versions of Apache Spark. Please use DataFrame/DataSet API instead.

The features of RDDs (decomposing the name):

Resilient, i.e. fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures.

Distributed with data residing on multiple nodes in a cluster.

Dataset is a collection of partitioned data with primitive values or values of values, e.g. tuples or other objects (that represent records of the data you work with).

For more information:

Mastering-Apache-Spark : RDD tutorial
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing by Matei Zaharia, et al.

4052 questions

votes

4 answers

Calculating the averages for each KEY in a Pairwise (K,V) RDD in Spark with Python

I want to share this particular Apache Spark with Python solution because documentation for it is quite poor. I wanted to calculate the average value of K/V pairs (stored in a Pairwise RDD), by KEY. Here is what the sample data looks like: >>>…

asked Apr 28 '15 at 21:18

NYCeyes

5,215
6
57
64

votes

2 answers

How spark read a large file (petabyte) when file can not be fit in spark's main memory

What will happen for large files in these cases? 1) Spark gets a location from NameNode for data . Will Spark stop in this same time because data size is too long as per information from NameNode? 2) Spark do partition of data as per datanode block…

apache-spark rdd partition

asked Oct 09 '17 at 04:32

Arpit Rai

votes

2 answers

How to convert Spark RDD to pandas dataframe in ipython?

I have a RDD and I want to convert it to pandas dataframe. I know that to convert and RDD to a normal dataframe we can do df = rdd1.toDF() But I want to convert the RDD to pandas dataframe and not a normal dataframe. How can I do it?

python pandas ipython pyspark rdd

asked Jan 15 '16 at 18:34

user2966197

2,793
10
45
77

votes

2 answers

How to extract an element from a array in pyspark

python apache-spark pyspark rdd

asked Jul 22 '17 at 13:08

AnmolDave

votes

2 answers

Concatenating datasets of different RDDs in Apache spark using scala

Is there a way to concatenate datasets of two different RDDs in spark? Requirement is - I create two intermediate RDDs using scala which has same column names, need to combine these results of both the RDDs and cache the result for accessing to UI.…

scala apache-spark apache-spark-sql distributed-computing rdd

asked Dec 10 '14 at 07:27

Atom

votes

2 answers

Pyspark: repartition vs partitionBy

I'm working through these two concepts right now and would like some clarity. From working through the command line, I've been trying to identify the differences and when a developer would use repartition vs partitionBy. Here is some sample…

apache-spark pyspark rdd

asked Nov 20 '15 at 16:30

Joe Widen

2,378
1
15
21

votes

3 answers

How to sort an RDD in Scala Spark?

Reading Spark method sortByKey : sortByKey([ascending], [numTasks]) When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the…

scala apache-spark rdd

asked May 23 '14 at 21:32

blue-sky

51,962
152
427
752

votes

2 answers

Would Spark unpersist the RDD itself when it realizes it won't be used anymore?

We can persist an RDD into memory and/or disk when we want to use it more than once. However, do we have to unpersist it ourselves later on, or does Spark does some kind of garbage collection and unpersist the RDD when it is no longer needed? I…

apache-spark hadoop rdd distributed-computing

asked Sep 17 '15 at 17:50

MetallicPriest

29,191
52
200
356

votes

3 answers

Apache Spark: Splitting Pair RDD into multiple RDDs by key to save values

I am using Spark 1.0.1 to process a large amount of data. Each row contains an ID number, some with duplicate IDs. I want to save all the rows with the same ID number in the same location, but I am having trouble doing it efficiently. I create an…

apache-spark filter rdd

asked Jul 30 '14 at 20:21

smli

votes

3 answers

How to get element by Index in Spark RDD (Java)

I know the method rdd.firstwfirst() which gives me the first element in an RDD. Also there is the method rdd.take(num) Which gives me the first "num" elements. But isn't there a possibility to get an element by index? Thanks.e

java apache-spark rdd

asked Nov 09 '14 at 13:41

progNewbie

4,362
9
48
107

votes

3 answers

Convert a simple one line string to RDD in Spark

I have a simple line: line = "Hello, world" I would like to convert it to an RDD with only one element. I have tried sc.parallelize(line) But it get: sc.parallelize(line).collect() ['H', 'e', 'l', 'l', 'o', ',', ' ', 'w', 'o', 'r', 'l',…

python apache-spark pyspark distributed-computing rdd

asked Oct 02 '14 at 09:07

poiuytrez

21,330
35
113
172

votes

2 answers

Apache spark dealing with case statements

I am dealing with transforming SQL code to PySpark code and came across some SQL statements. I don't know how to approach case statments in pyspark? I am planning on creating a RDD and then using rdd.map and then do some logic checks. Is that the…

apache-spark pyspark rdd apache-spark-sql

asked Oct 11 '16 at 16:27

Amardeep Flora

1,255
6
13
29

votes

3 answers

Spark: Difference between Shuffle Write, Shuffle spill (memory), Shuffle spill (disk)?

I have the following spark job, trying to keep everything in memory: val myOutRDD = myInRDD.flatMap { fp => val tuple2List: ListBuffer[(String, myClass)] = ListBuffer() : tuple2List }.persist(StorageLevel.MEMORY_ONLY).reduceByKey { (p1,…

apache-spark shuffle rdd persist

asked Aug 25 '15 at 17:05

Edamame

23,718
73
186
320

votes

2 answers

Spark: Efficient way to test if an RDD is empty

There is not an isEmpty method on RDD's, so what is the most efficient way of testing if an RDD is empty?

scala apache-spark rdd

asked Feb 11 '15 at 12:28

Tobber

7,211
8
33
56

votes

4 answers

Join two ordinary RDDs with/without Spark SQL

I need to join two ordinary RDDs on one/more columns. Logically this operation is equivalent to the database join operation of two tables. I wonder if this is possible only through Spark SQL or there are other ways of doing it. As a concrete…

scala join apache-spark rdd apache-spark-sql

asked Dec 12 '14 at 05:38

learning_spark

Prev 1 2

…

99 100 Next