0

I am trying to build a component which dynamically joins a large dataset to several much smaller datasets. I already have the large and smaller datasets persisted in memory as data frames. As user inputs come in, I need to select a subset of the large dataset and enrich it with some information from the smaller datasets.

Unfortunately, these dynamic joins are proving to be expensive, on the order of minutes rather than seconds. An avenue I would like to explore is shipping replicas of the smaller datasets to all nodes on my cluster such that the join happens simultaneously on each node and the results are simply collected at the end. I am, however, not sure of the best way to do this.

Broadcast variables seem to be the only way to ship data across nodes for computations. However, the Spark documentation doesn't say much about appropriate use cases. Would what I described above be suitable for broadcast variables? Is it acceptable or even possible to use data frames as broadcast variables? Are there any other, better avenues available to me to quickly join data frames like these?

Vishakh
  • 1,168
  • 1
  • 11
  • 20
  • _Is it acceptable or even possible to use data frames as broadcast variables_ - see for example http://stackoverflow.com/q/35235450/1560062. There are other ways to handle reference data including files distribution or accessing external systems, but this is not a good question for SO. – zero323 Feb 10 '16 at 20:59
  • Thank you, that was very helpful. Would you know any resources for finding out more about handling reference data on Spark? – Vishakh Feb 11 '16 at 18:06

1 Answers1

1

Whether this is actually faster depends on the size of your small datasets and how often you want to change them.

In any case you cannot broadcast a Spark DataFrame but instead need to broadcast the small dataset as an ordinary variable/structure. I would also recommend trying the join in mapPartitions rather than an ordinary map and see if this speeds things up.

also: You will not be able to use Spark's join inside the worker routine but will have to either define your own routine or use the join routine of a library that can deal with the types of the datasets.

Jonathan
  • 358
  • 3
  • 14