I have big data set (lets say 4gb) that is used as a reference source to process another big data set (100-200gb) I have cluster for 30 executors to do that on 10 nodes. So for every executor I have own jvm, right? Everytime it loads whole reference dataset. And it takes long time and inefficient. If there are any good approach to handle this? Currently I am storing data at s3 aws and run everything with emr. May be it is good to use more elegant storage that I could query on fly, or spin up for example redis as part of my cluster and push data their and than query it?
UPD1:
- Flat data is gziped csv files on S3 partioned by 128Mb.
- It is read into Dataset (coalesce is for reducing number of partitions in order to spread data across less nodes)
val df = sparkSession.sqlContext.read.format("com.databricks.spark.csv")
.option("header", "false")
.schema(schema)
.option("delimiter", ",")
.load(path)
.coalesce(3)
.as[SegmentConflationRef]
- Than I need to convert flat data into ordered grouped list and put to some key value storage, in-memory map in that case.
val data: Seq[SegmentConflationRef] = ds.collect()
val map = mutable.Map[String, Seq[SegmentConflationRef]]()
data.groupBy(_.source_segment_id).map(c => {
map += (c._1 -> c._2.sortBy(_.source_start_offset_m))
})
- After that I am going to do lookup from another Dataset.
So in that case I want to have reference map to be copied in every executor. One problem is how to broadcast such big map across nodes, or what should be better approach? Probably not using Spark from the begining and load data locally from hdfs in every executor?