Questions tagged [rdd]

Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce.

Resilient Distributed Datasets (a.k.a RDDs) are a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce.

RDDs provide a highly restricted form of shared memory: they are read-only datasets that can only be constructed through bulk operations on other RDDs.

RDD is the primary data abstraction in Apache Spark and the core of Spark (that I often refer to as "Spark Core").

Warning: Please notice that RDD API is a very low-level construction, and isn't recommended to use in the modern versions of Apache Spark. Please use DataFrame/DataSet API instead.

The features of RDDs (decomposing the name):

Resilient, i.e. fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures.

Distributed with data residing on multiple nodes in a cluster.

Dataset is a collection of partitioned data with primitive values or values of values, e.g. tuples or other objects (that represent records of the data you work with).

For more information:

  1. Mastering-Apache-Spark : RDD tutorial

  2. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing by Matei Zaharia, et al.

4052 questions
59
votes
5 answers

Spark parquet partitioning : Large number of files

I am trying to leverage spark partitioning. I was trying to do something like data.write.partitionBy("key").parquet("/location") The issue here each partition creates huge number of parquet files which result slow read if I am trying to read from…
58
votes
2 answers

'PipelinedRDD' object has no attribute 'toDF' in PySpark

I'm trying to load an SVM file and convert it to a DataFrame so I can use the ML module (Pipeline ML) from Spark. I've just installed a fresh Spark 1.5.0 on an Ubuntu 14.04 (no spark-env.sh configured). My my_script.py is: from pyspark.mllib.util…
Frederico Oliveira
  • 2,283
  • 3
  • 14
  • 10
56
votes
9 answers

Explain the aggregate functionality in Spark (with Python and Scala)

I am looking for some better explanation of the aggregate functionality that is available via spark in python. The example I have is as follows (using pyspark from Spark 1.2.0 version) sc.parallelize([1,2,3,4]).aggregate( (0, 0), (lambda acc,…
ab_tech_sp
  • 943
  • 2
  • 9
  • 7
52
votes
9 answers

Spark specify multiple column conditions for dataframe join

How to give more column conditions when joining two dataframes. For example I want to run the following : val Lead_all = Leads.join(Utm_Master, Leaddetails.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign") == …
user568109
  • 47,225
  • 17
  • 99
  • 123
52
votes
14 answers

Spark read file from S3 using sc.textFile ("s3n://...)

Trying to read a file located in S3 using spark-shell: scala> val myRdd = sc.textFile("s3n://myBucket/myFile1.log") lyrics: org.apache.spark.rdd.RDD[String] = s3n://myBucket/myFile1.log MappedRDD[55] at textFile at :12 scala>…
Polymerase
  • 6,311
  • 11
  • 47
  • 65
52
votes
10 answers

Reduce a key-value pair into a key-list pair with Apache Spark

I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, Vn) into one Key-Multivalue pair (K, [V1, V2, ..., Vn]). I feel like I should be able to do this using the reduceByKey function with something…
TravisJ
  • 1,592
  • 1
  • 21
  • 37
51
votes
3 answers

Number of partitions in RDD and performance in Spark

In Pyspark, I can create a RDD from a list and decide how many partitions to have: sc = SparkContext() sc.parallelize(xrange(0, 10), 4) How does the number of partitions I decide to partition my RDD in influence the performance? And how does this…
mar tin
  • 9,266
  • 23
  • 72
  • 97
50
votes
3 answers

How to find spark RDD/Dataframe size?

I know how to find the file size in scala.But how to find a RDD/dataframe size in spark? Scala: object Main extends App { val file = new java.io.File("hdfs://localhost:9000/samplefile.txt").toString() println(file.length) } Spark: val distFile…
Venu A Positive
  • 2,992
  • 2
  • 28
  • 31
49
votes
0 answers

Difference between DataSet API and DataFrame API

Does anyone can help me to understand difference between DataSet API and DataFrame API with an example? Why there was there a need to introduce the DataSet API in Spark?
Shashi
  • 2,686
  • 7
  • 35
  • 67
48
votes
10 answers

What is RDD in spark

Definition says: RDD is immutable distributed collection of objects I don't quite understand what does it mean. Is it like data (partitioned objects) stored on hard disk If so then how come RDD's can have user-defined classes (Such as java,…
kittu
  • 6,662
  • 21
  • 91
  • 185
47
votes
4 answers

How to read from hbase using spark

The below code will read from the hbase, then convert it to json structure and the convert to schemaRDD , But the problem is that I am using List to store the json string then pass to javaRDD, for data of about 100 GB the master will be loaded with…
madan ram
  • 1,260
  • 2
  • 19
  • 26
43
votes
3 answers

Spark union of multiple RDDs

In my pig code I do this: all_combined = Union relation1, relation2, relation3, relation4, relation5, relation 6. I want to do the same with spark. However, unfortunately, I see that I have to keep doing it pairwise: first =…
user3803714
  • 5,269
  • 10
  • 42
  • 61
41
votes
4 answers

How do I split an RDD into two or more RDDs?

I'm looking for a way to split an RDD into two or more RDDs. The closest I've seen is Scala Spark: Split collection into several RDD? which is still a single RDD. If you're familiar with SAS, something like this: data work.split1, work.split2; …
Carlos Bribiescas
  • 4,197
  • 9
  • 35
  • 66
40
votes
5 answers

Difference between SparkContext, JavaSparkContext, SQLContext, and SparkSession?

What is the difference between SparkContext, JavaSparkContext, SQLContext and SparkSession? Is there any method to convert or create a Context using a SparkSession? Can I completely replace all the Contexts using one single entry SparkSession? Are…
39
votes
1 answer

Spark RDD - Mapping with extra arguments

Is it possible to pass extra arguments to the mapping function in pySpark? Specifically, I have the following code recipe: raw_data_rdd = sc.textFile("data.json", use_unicode=True) json_data_rdd = raw_data_rdd.map(lambda line:…
Stan
  • 1,042
  • 2
  • 13
  • 29