Questions tagged [rdd]

Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce.

Resilient Distributed Datasets (a.k.a RDDs) are a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce.

RDDs provide a highly restricted form of shared memory: they are read-only datasets that can only be constructed through bulk operations on other RDDs.

RDD is the primary data abstraction in Apache Spark and the core of Spark (that I often refer to as "Spark Core").

Warning: Please notice that RDD API is a very low-level construction, and isn't recommended to use in the modern versions of Apache Spark. Please use DataFrame/DataSet API instead.

The features of RDDs (decomposing the name):

Resilient, i.e. fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures.

Distributed with data residing on multiple nodes in a cluster.

Dataset is a collection of partitioned data with primitive values or values of values, e.g. tuples or other objects (that represent records of the data you work with).

For more information:

Mastering-Apache-Spark : RDD tutorial
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing by Matei Zaharia, et al.

4052 questions

votes

5 answers

Spark parquet partitioning : Large number of files

I am trying to leverage spark partitioning. I was trying to do something like data.write.partitionBy("key").parquet("/location") The issue here each partition creates huge number of parquet files which result slow read if I am trying to read from…

asked Jun 28 '17 at 16:49

Avishek Bhattacharya

6,534
3
34
53

votes

2 answers

'PipelinedRDD' object has no attribute 'toDF' in PySpark

I'm trying to load an SVM file and convert it to a DataFrame so I can use the ML module (Pipeline ML) from Spark. I've just installed a fresh Spark 1.5.0 on an Ubuntu 14.04 (no spark-env.sh configured). My my_script.py is: from pyspark.mllib.util…

python apache-spark pyspark apache-spark-sql rdd

asked Sep 25 '15 at 18:21

Frederico Oliveira

2,283
3
14
10

votes

9 answers

Explain the aggregate functionality in Spark (with Python and Scala)

I am looking for some better explanation of the aggregate functionality that is available via spark in python. The example I have is as follows (using pyspark from Spark 1.2.0 version) sc.parallelize([1,2,3,4]).aggregate( (0, 0), (lambda acc,…

python scala apache-spark aggregate rdd

asked Jan 30 '15 at 16:49

ab_tech_sp

votes

9 answers

Spark specify multiple column conditions for dataframe join

How to give more column conditions when joining two dataframes. For example I want to run the following : val Lead_all = Leads.join(Utm_Master, Leaddetails.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign") == …

apache-spark apache-spark-sql rdd

asked Jul 06 '15 at 07:35

user568109

47,225
17
99
123

votes

14 answers

Spark read file from S3 using sc.textFile ("s3n://...)

Trying to read a file located in S3 using spark-shell: scala> val myRdd = sc.textFile("s3n://myBucket/myFile1.log") lyrics: org.apache.spark.rdd.RDD[String] = s3n://myBucket/myFile1.log MappedRDD[55] at textFile at :12 scala>…

java scala apache-spark rdd hortonworks-data-platform

asked Jun 15 '15 at 17:23

Polymerase

6,311
11
47
65

votes

10 answers

Reduce a key-value pair into a key-list pair with Apache Spark

I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, Vn) into one Key-Multivalue pair (K, [V1, V2, ..., Vn]). I feel like I should be able to do this using the reduceByKey function with something…

python apache-spark mapreduce pyspark rdd

asked Nov 18 '14 at 19:15

TravisJ

1,592
1
21
37

votes

3 answers

Number of partitions in RDD and performance in Spark

In Pyspark, I can create a RDD from a list and decide how many partitions to have: sc = SparkContext() sc.parallelize(xrange(0, 10), 4) How does the number of partitions I decide to partition my RDD in influence the performance? And how does this…

performance apache-spark pyspark rdd

asked Mar 04 '16 at 16:13

mar tin

9,266
23
72
97

votes

3 answers

How to find spark RDD/Dataframe size?

I know how to find the file size in scala.But how to find a RDD/dataframe size in spark? Scala: object Main extends App { val file = new java.io.File("hdfs://localhost:9000/samplefile.txt").toString() println(file.length) } Spark: val distFile…

scala apache-spark rdd

asked Jan 26 '16 at 06:28

Venu A Positive

2,992
2
28
31

votes

0 answers

Difference between DataSet API and DataFrame API

Does anyone can help me to understand difference between DataSet API and DataFrame API with an example? Why there was there a need to introduce the DataSet API in Spark?

apache-spark apache-spark-sql rdd apache-spark-dataset

asked May 18 '16 at 13:33

Shashi

2,686
7
35
67

votes

10 answers

What is RDD in spark

Definition says: RDD is immutable distributed collection of objects I don't quite understand what does it mean. Is it like data (partitioned objects) stored on hard disk If so then how come RDD's can have user-defined classes (Such as java,…

scala hadoop apache-spark rdd

asked Dec 23 '15 at 10:01

kittu

6,662
21
91
185

votes

4 answers

How to read from hbase using spark

The below code will read from the hbase, then convert it to json structure and the convert to schemaRDD , But the problem is that I am using List to store the json string then pass to javaRDD, for data of about 100 GB the master will be loaded with…

hbase apache-spark rdd

asked Jul 30 '14 at 15:22

madan ram

1,260
2
19
26

votes

3 answers

Spark union of multiple RDDs

In my pig code I do this: all_combined = Union relation1, relation2, relation3, relation4, relation5, relation 6. I want to do the same with spark. However, unfortunately, I see that I have to keep doing it pairwise: first =…

python apache-spark pyspark rdd

asked Nov 16 '15 at 20:25

user3803714

5,269
10
42
61

votes

4 answers

How do I split an RDD into two or more RDDs?

I'm looking for a way to split an RDD into two or more RDDs. The closest I've seen is Scala Spark: Split collection into several RDD? which is still a single RDD. If you're familiar with SAS, something like this: data work.split1, work.split2; …

apache-spark pyspark rdd

asked Oct 06 '15 at 13:02

Carlos Bribiescas

4,197
9
35
66

votes

5 answers

Difference between SparkContext, JavaSparkContext, SQLContext, and SparkSession?

What is the difference between SparkContext, JavaSparkContext, SQLContext and SparkSession? Is there any method to convert or create a Context using a SparkSession? Can I completely replace all the Contexts using one single entry SparkSession? Are…

java scala apache-spark rdd apache-spark-dataset

asked May 05 '17 at 10:37

Manikandan Balasubramanian

1,079
4
14
27

votes

1 answer

Spark RDD - Mapping with extra arguments

Is it possible to pass extra arguments to the mapping function in pySpark? Specifically, I have the following code recipe: raw_data_rdd = sc.textFile("data.json", use_unicode=True) json_data_rdd = raw_data_rdd.map(lambda line:…

python apache-spark pyspark rdd

asked Oct 08 '15 at 14:59

Stan

1,042
2
13
29

Prev 1

…

99 100 Next