How to handle big reference data in Spark

Question

I have big data set (lets say 4gb) that is used as a reference source to process another big data set (100-200gb) I have cluster for 30 executors to do that on 10 nodes. So for every executor I have own jvm, right? Everytime it loads whole reference dataset. And it takes long time and inefficient. If there are any good approach to handle this? Currently I am storing data at s3 aws and run everything with emr. May be it is good to use more elegant storage that I could query on fly, or spin up for example redis as part of my cluster and push data their and than query it?

UPD1:

Flat data is gziped csv files on S3 partioned by 128Mb.
It is read into Dataset (coalesce is for reducing number of partitions in order to spread data across less nodes)


    val df = sparkSession.sqlContext.read.format("com.databricks.spark.csv")
          .option("header", "false")
          .schema(schema)
          .option("delimiter", ",")
          .load(path)
          .coalesce(3)
          .as[SegmentConflationRef]

Than I need to convert flat data into ordered grouped list and put to some key value storage, in-memory map in that case.

    val data: Seq[SegmentConflationRef] = ds.collect()
    val map = mutable.Map[String, Seq[SegmentConflationRef]]()
    data.groupBy(_.source_segment_id).map(c => {
      map += (c._1 -> c._2.sortBy(_.source_start_offset_m))
    })

After that I am going to do lookup from another Dataset.

So in that case I want to have reference map to be copied in every executor. One problem is how to broadcast such big map across nodes, or what should be better approach? Probably not using Spark from the begining and load data locally from hdfs in every executor?

Can you provide more details? For example, what do you mean every time it loads the whole reference set? If you perform an operation in Spark, then do something to see what happened (e.g. show, count etc) then it will execute each and every step up to that point (from the beginning). One solution to this - especially if you plan to do a few things with a result such as show it and count the number of records - is to "persist" the dataframe/dataset or RDD you are working with - for good reason, this is how spark is designed to work. — GMc, May 07 '19 at 10:31
But it is hard to say what the best solution is without seeing what you are trying to achieve. — GMc, May 07 '19 at 10:31
I think your question is too broad, you may have the right technology but the wrong cluster or the wrong spark job configuration or the wrong model. I don't like discarding anything but I would be really surprised if the right way of approaching a solution is to hit memory db like redis from the cluster. — fd8s0, May 07 '19 at 15:22

score 1 · Answer 1 · answered May 07 '19 at 06:53

Sadly, Apache Spark is not a plug and play solution to any problem.

First, you must have general understanding of how Apache Spark works. Then, you have to use Spark UI for monitoring and seeing why your process is not optimal. The official documents linked on this page are usually a good start:

https://spark.apache.org/docs/latest/index.html

What's really useful is learning to use Spark Web UI! Once you understand what every piece of information means there - you know where's your application bottleneck. This article covers basic components of Spark Web UI: https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html

score -2 · Answer 2 · answered Oct 31 '22 at 16:09

-2

You can use cache() or persist() a dataframe. https://towardsdatascience.com/best-practices-for-caching-in-spark-sql-b22fb0f02d34

answered Oct 31 '22 at 16:09

Nitesh Agarwal

659
1
9
14

How to handle big reference data in Spark

2 Answers2