Questions tagged [rdd]

Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce.

Resilient Distributed Datasets (a.k.a RDDs) are a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce.

RDDs provide a highly restricted form of shared memory: they are read-only datasets that can only be constructed through bulk operations on other RDDs.

RDD is the primary data abstraction in Apache Spark and the core of Spark (that I often refer to as "Spark Core").

Warning: Please notice that RDD API is a very low-level construction, and isn't recommended to use in the modern versions of Apache Spark. Please use DataFrame/DataSet API instead.

The features of RDDs (decomposing the name):

Resilient, i.e. fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures.

Distributed with data residing on multiple nodes in a cluster.

Dataset is a collection of partitioned data with primitive values or values of values, e.g. tuples or other objects (that represent records of the data you work with).

For more information:

Mastering-Apache-Spark : RDD tutorial
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing by Matei Zaharia, et al.

4052 questions

vote

0 answers

Saving JSON type spark RDD to Cassandra table

I want to store spark RDD to Cassandra table but it's not working. My RDD is in the form {"id":"04bBGJpwUh","date":"2018-03-26 05:28:25","temp":37,"press":16} {"id":"pi4Axn3iOd","date":"2018-03-26 05:28:27","temp":49,"press":17} My cassandra table…

asked Mar 28 '18 at 12:45

huny

vote

0 answers

Convert Key/Pair RDD to yield sum of values, min and max values in each group using Python SPark

I am new to Spark, I have the below RDD (2, 2.0) (2, 4.0) (2, 1.5) (2, 6.0) (2, 7.0) (2, 8.0) I tried to convert it to (2, 28.5, 1.5, 8) where 2 is the key Value, followed by 28.5 a sum of all values, 1.5 as the minimum value and 8 being the max.…

python apache-spark rdd

asked Mar 28 '18 at 10:16

Ursus

vote

1 answer

groupByKey of RDD not getting passed through

Have a query regarding groupByKey on my RDD. Below is the query I'm trying: rdd3.map{ case(HandleMaxTuple(col1, col2, col3, col4, col5, col6, col7, col8, col9, col10, col11, col12, col13, col14, col15, col16, col17, col18, col19, col20, col21,…

scala apache-spark tuples rdd

asked Mar 28 '18 at 01:11

knowone

vote

1 answer

number of tuples limit in RDD; reading RDD throws arrayIndexOutOfBoundsException

I tried a modification of DF to RDD for a table containing 25 columns. Thereafter I came to know that Scala (until 2.11.8) has a limitation of a max of 22 tuples that could be used. val rdd = sc.textFile("/user/hive/warehouse/myDB.db/myTable/") rdd:…

scala apache-spark apache-spark-sql rdd

asked Mar 27 '18 at 05:51

knowone

vote

1 answer

Scala - How to read a csv table into a RDD[Vector]

I would like to read from a huge csv file, assign every row to a vector via spliting values by ",". In the end I aim to have an RDD of Vectors which holds the values. However I get an error after Seq: type mismatch; found : Unit required:…

scala csv apache-spark matrix rdd

asked Mar 24 '18 at 12:19

Tolga

vote

0 answers

Working on Local Partitions in Spark

I have a huge file stored in S3 and loading ii into my Spark Cluster and i want to invoke a custom Java Library which takes a Input File Location, process the Data and writes to a given output location. How ever i cannot rewrite that custom logic in…

apache-spark rdd

asked Mar 20 '18 at 14:40

Sateesh K

1,071
3
19
45

vote

2 answers

apache spark - iteratively skip and take from RDD

Given an RDD, what's the best way to sort it and then consume it in discrete sized chunks? For example: JavaRDD baseRdd = sc.parallelize(Arrays.asList(1,2,5,3,4)); JavaRDD sorted = baseRdd.sortBy(x -> x, true, 5); //…

java apache-spark rdd

asked Mar 19 '18 at 17:04

Kyle Fransham

1,859
1
19
22

vote

1 answer

NotSerializableException in Spark

Most of the non-serializable issues online get very basic data as an input for their sc.parallelize() and in the map section they encounter the non-serializable issue, but mine is a type. I have a specific data type, which is coming from a third…

scala apache-spark rdd serializable

asked Mar 19 '18 at 11:48

Arsinux

vote

2 answers

how to sort the rdd with mixed ascending and descending on multiple fields in Scala

So here is the data. Structure explained: CREATE TABLE products ( product_id int(11) NOT NULL AUTO_INCREMENT, product_category_id int(11) NOT NULL, product_name varchar(45) NOT NULL, product_description varchar(255) NOT NULL, …

scala sorting apache-spark rdd

asked Mar 17 '18 at 13:52

Choix

vote

1 answer

Convert rdd rows into one columns

I am trying to get all the values from Rows into Columns. I don't have an Index, so find it hard to have all in one column. Code: getting the values traceFilters = sqlContext.read.format("csv").options(header='true', delimiter =…

python dataframe apache-spark-sql rdd

asked Mar 16 '18 at 07:19

user5813190

vote

1 answer

how to extract values in array of array strings in RDD

val rdd :Array[Array[String]] = Array(Array("2345","345","fghj","dfhg") ,Array("2345","3450","fghj","dfhg") ,Array("23145","1345","fghj","dffghg") …

scala apache-spark apache-spark-sql rdd

asked Mar 16 '18 at 05:57

premon

vote

2 answers

Getting ArrayIndexOutOfBoundsException while splitting record from a file in Scala

My file contains records like this: 11001^1^100^2015-06-05 22:35:21.543^

scala apache-spark rdd indexoutofboundsexception

asked Mar 15 '18 at 08:15

Kumar Harsh

vote

1 answer

How do I split a Spark rdd Array[(String, Array[String])] to a single RDD

I want to split the following RDD into a single RDD(id,(all name same type)). >val test = rddByKey.map{case(k,v)=> (k,v.collect())} test: Array[(String, Array[String])] = Array( (45000,Array(Amit, Pavan, Ratan)), (10000,Array(Kumar,…

scala apache-spark rdd

asked Mar 13 '18 at 01:36

Biswajit

vote

0 answers

Get the fields and values from a Spark RDD

I have a spark RDD(org.apache.spark.rdd) made from a JSON string that looks like this : {"id":1 , "text":"sample1"} I am using spray JSON in my application and I need to extract the keys into a JsArray (keys_jsArray) - (contains id, text). Also…

json apache-spark rdd

asked Mar 12 '18 at 05:19

Nagireddy Hanisha

1,290
4
17
39

vote

0 answers

How to handle and convert null datetime field into unixtime stamp in Scala

I have some code snippet as below that is not accepted by Scala, it would be appreciated if someone can help to fix it, thanks. train_no_header is an RDD generated from a csv file, its first line is shown as below: scala> train_no_header.first res4:…

rdd unix-timestamp

asked Mar 08 '18 at 02:47

Choix

Prev 1 2 3

…

100