Split RDD into many RDDs and Cache

Question

I have an rdd like so

(aid, session, sessionnew, date)
(55-BHA, 58, 15, 2017-05-09)
(07-YET, 18, 5, 2017-05-09)
(32-KXD, 27, 20, 2017-05-09)
(19-OJD, 10, 1, 2017-05-09)
(55-BHA, 1, 0, 2017-05-09)
(55-BHA, 19, 3, 2017-05-09)
(32-KXD, 787, 345, 2017-05-09)
(07-YET, 4578, 1947, 2017-05-09)
(07-YET, 23, 5, 2017-05-09)
(32-KXD, 85, 11, 2017-05-09)

I want to split everything with the same aid to a new rdd and then cache that for use later, so one rdd per unique aid. I saw some other answers but they are saving the rdds to files. Is there a problem with saving this many rdds in memory? It will likely be around 30k+

I save the cached rdd with spark jobserver.

score 0 · Answer 1 · answered May 10 '17 at 07:33

I would suggest you to cache the grouped rdd as below
lets say you have rdd data as :

val rddData = sparkContext.parallelize(Seq(
      ("55-BHA", 58, 15, "2017-05-09"),
      ("07-YET", 18, 5, "2017-05-09"),
      ("32-KXD", 27, 20, "2017-05-09"),
      ("19-OJD", 10, 1, "2017-05-09"),
      ("55-BHA", 1, 0, "2017-05-09"),
      ("55-BHA", 19, 3, "2017-05-09"),
      ("32-KXD", 787, 345, "2017-05-09"),
      ("07-YET", 4578, 1947, "2017-05-09"),
      ("07-YET", 23, 5, "2017-05-09"),
      ("32-KXD", 85, 11, "2017-05-09")))

You can cache the data by grouping with "aid" and use filter to select the grouped data you need as :

val grouped = rddData.groupBy(_._1).cache
val filtered = grouped.filter(_._1 equals("32-KXD"))

But I would suggest you to use DataFrame as below which is efficient and improved than rdds

import sqlContext.implicits._
val dataFrame = Seq(
  ("55-BHA", 58, 15, "2017-05-09"),
("07-YET", 18, 5, "2017-05-09"),
("32-KXD", 27, 20, "2017-05-09"),
("19-OJD", 10, 1, "2017-05-09"),
("55-BHA", 1, 0, "2017-05-09"),
("55-BHA", 19, 3, "2017-05-09"),
("32-KXD", 787, 345, "2017-05-09"),
("07-YET", 4578, 1947, "2017-05-09"),
("07-YET", 23, 5, "2017-05-09"),
("32-KXD", 85, 11, "2017-05-09")).toDF("aid", "session", "sessionnew", "date").cache

val newDF = dataFrame.select("*").where(dataFrame("aid") === "32-KXD")
newDF.show

I hope it helps

Split RDD into many RDDs and Cache

1 Answers1