Questions tagged [rdd]

Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce.

Resilient Distributed Datasets (a.k.a RDDs) are a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce.

RDDs provide a highly restricted form of shared memory: they are read-only datasets that can only be constructed through bulk operations on other RDDs.

RDD is the primary data abstraction in Apache Spark and the core of Spark (that I often refer to as "Spark Core").

Warning: Please notice that RDD API is a very low-level construction, and isn't recommended to use in the modern versions of Apache Spark. Please use DataFrame/DataSet API instead.

The features of RDDs (decomposing the name):

Resilient, i.e. fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures.

Distributed with data residing on multiple nodes in a cluster.

Dataset is a collection of partitioned data with primitive values or values of values, e.g. tuples or other objects (that represent records of the data you work with).

For more information:

  1. Mastering-Apache-Spark : RDD tutorial

  2. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing by Matei Zaharia, et al.

4052 questions
1
vote
0 answers

Saving JSON type spark RDD to Cassandra table

I want to store spark RDD to Cassandra table but it's not working. My RDD is in the form {"id":"04bBGJpwUh","date":"2018-03-26 05:28:25","temp":37,"press":16} {"id":"pi4Axn3iOd","date":"2018-03-26 05:28:27","temp":49,"press":17} My cassandra table…
1
vote
0 answers

Convert Key/Pair RDD to yield sum of values, min and max values in each group using Python SPark

I am new to Spark, I have the below RDD (2, 2.0) (2, 4.0) (2, 1.5) (2, 6.0) (2, 7.0) (2, 8.0) I tried to convert it to (2, 28.5, 1.5, 8) where 2 is the key Value, followed by 28.5 a sum of all values, 1.5 as the minimum value and 8 being the max.…
Ursus
  • 11
  • 1
1
vote
1 answer

groupByKey of RDD not getting passed through

Have a query regarding groupByKey on my RDD. Below is the query I'm trying: rdd3.map{ case(HandleMaxTuple(col1, col2, col3, col4, col5, col6, col7, col8, col9, col10, col11, col12, col13, col14, col15, col16, col17, col18, col19, col20, col21,…
knowone
  • 840
  • 2
  • 16
  • 37
1
vote
1 answer

number of tuples limit in RDD; reading RDD throws arrayIndexOutOfBoundsException

I tried a modification of DF to RDD for a table containing 25 columns. Thereafter I came to know that Scala (until 2.11.8) has a limitation of a max of 22 tuples that could be used. val rdd = sc.textFile("/user/hive/warehouse/myDB.db/myTable/") rdd:…
knowone
  • 840
  • 2
  • 16
  • 37
1
vote
1 answer

Scala - How to read a csv table into a RDD[Vector]

I would like to read from a huge csv file, assign every row to a vector via spliting values by ",". In the end I aim to have an RDD of Vectors which holds the values. However I get an error after Seq: type mismatch; found : Unit required:…
Tolga
  • 116
  • 2
  • 12
1
vote
0 answers

Working on Local Partitions in Spark

I have a huge file stored in S3 and loading ii into my Spark Cluster and i want to invoke a custom Java Library which takes a Input File Location, process the Data and writes to a given output location. How ever i cannot rewrite that custom logic in…
Sateesh K
  • 1,071
  • 3
  • 19
  • 45
1
vote
2 answers

apache spark - iteratively skip and take from RDD

Given an RDD, what's the best way to sort it and then consume it in discrete sized chunks? For example: JavaRDD baseRdd = sc.parallelize(Arrays.asList(1,2,5,3,4)); JavaRDD sorted = baseRdd.sortBy(x -> x, true, 5); //…
Kyle Fransham
  • 1,859
  • 1
  • 19
  • 22
1
vote
1 answer

NotSerializableException in Spark

Most of the non-serializable issues online get very basic data as an input for their sc.parallelize() and in the map section they encounter the non-serializable issue, but mine is a type. I have a specific data type, which is coming from a third…
Arsinux
  • 173
  • 1
  • 4
  • 13
1
vote
2 answers

how to sort the rdd with mixed ascending and descending on multiple fields in Scala

So here is the data. Structure explained: CREATE TABLE products ( product_id int(11) NOT NULL AUTO_INCREMENT, product_category_id int(11) NOT NULL, product_name varchar(45) NOT NULL, product_description varchar(255) NOT NULL, …
Choix
  • 555
  • 1
  • 12
  • 28
1
vote
1 answer

Convert rdd rows into one columns

I am trying to get all the values from Rows into Columns. I don't have an Index, so find it hard to have all in one column. Code: getting the values traceFilters = sqlContext.read.format("csv").options(header='true', delimiter =…
user5813190
1
vote
1 answer

how to extract values in array of array strings in RDD

val rdd :Array[Array[String]] = Array(Array("2345","345","fghj","dfhg") ,Array("2345","3450","fghj","dfhg") ,Array("23145","1345","fghj","dffghg") …
premon
  • 159
  • 1
  • 3
  • 13
1
vote
2 answers

Getting ArrayIndexOutOfBoundsException while splitting record from a file in Scala

My file contains records like this: 11001^1^100^2015-06-05 22:35:21.543^
Kumar Harsh
  • 423
  • 5
  • 26
1
vote
1 answer

How do I split a Spark rdd Array[(String, Array[String])] to a single RDD

I want to split the following RDD into a single RDD(id,(all name same type)). >val test = rddByKey.map{case(k,v)=> (k,v.collect())} test: Array[(String, Array[String])] = Array( (45000,Array(Amit, Pavan, Ratan)), (10000,Array(Kumar,…
Biswajit
  • 13
  • 3
1
vote
0 answers

Get the fields and values from a Spark RDD

I have a spark RDD(org.apache.spark.rdd) made from a JSON string that looks like this : {"id":1 , "text":"sample1"} I am using spray JSON in my application and I need to extract the keys into a JsArray (keys_jsArray) - (contains id, text). Also…
Nagireddy Hanisha
  • 1,290
  • 4
  • 17
  • 39
1
vote
0 answers

How to handle and convert null datetime field into unixtime stamp in Scala

I have some code snippet as below that is not accepted by Scala, it would be appreciated if someone can help to fix it, thanks. train_no_header is an RDD generated from a csv file, its first line is shown as below: scala> train_no_header.first res4:…
Choix
  • 555
  • 1
  • 12
  • 28
1 2 3
99
100