I am trying to build a component which dynamically joins a large dataset to several much smaller datasets. I already have the large and smaller datasets persisted in memory as data frames. As user inputs come in, I need to select a subset of the large dataset and enrich it with some information from the smaller datasets.
Unfortunately, these dynamic joins are proving to be expensive, on the order of minutes rather than seconds. An avenue I would like to explore is shipping replicas of the smaller datasets to all nodes on my cluster such that the join happens simultaneously on each node and the results are simply collected at the end. I am, however, not sure of the best way to do this.
Broadcast variables seem to be the only way to ship data across nodes for computations. However, the Spark documentation doesn't say much about appropriate use cases. Would what I described above be suitable for broadcast variables? Is it acceptable or even possible to use data frames as broadcast variables? Are there any other, better avenues available to me to quickly join data frames like these?