Spark - Iteration between Datasets without collecting data

Question

In a certain moment of my code I have two different typed Datasets. I need data from one to filter data to the other. Assuming there is no way to change the code from this point back, is there any way to do what I'm describing in the comment below without collecting all data from report2Ds and use it inside Spark function?

Dataset<Report1> report1Ds ...
Dataset<Report2> report2Ds ...

report1Ds.map((MapFunction<Report3>) report -> {

String company = report.getCompany();
// get data from report2Ds where report2.getEmployeer().equals(company);

}, kryo(Report3.class));

Any suggestion, or even help on better designs to avoid cases like this, will be really appreciated.

`join` or `joinWith`, possibly in [typish manner](https://stackoverflow.com/q/40605167/10465355) — 10465355, Feb 28 '19 at 10:35

abiratsis · Accepted Answer · 2019-03-03T11:04:37.180

Without changing your approach no! This is not possible because within the map block you can't user directly driver's abstractions (datasets, dataframes or Spark context). Please refer to the next links for further information:

Apache Spark : When not to use mapPartition and foreachPartition?

Caused by: java.lang.NullPointerException at org.apache.spark.sql.Dataset

A different approach would be to identify the linking fields between the two Datasets, join them (aka report1Ds.join(report2Ds, report1Ds.company == report2Ds.employeer) according to your example) and then apply the filters in regards to the logic that you want.

Spark - Iteration between Datasets without collecting data

1 Answers1