Context
I want to iterate over a Spark Dataset and update a HashMap for each row.
Here is the code I have:
// At this point, I have a my_dataset variable containing 300 000 rows and 10 columns
// - my_dataset.count() == 300 000
// - my_dataset.columns().length == 10
// Declare my HashMap
HashMap<String, Vector<String>> my_map = new HashMap<String, Vector<String>>();
// Initialize the map
for(String col : my_dataset.columns())
{
my_map.put(col, new Vector<String>());
}
// Iterate over the dataset and update the map
my_dataset.foreach( (ForeachFunction<Row>) row -> {
for(String col : my_map.KeySet())
{
my_map.get(col).add(row.get(row.fieldIndex(col)).toString());
}
});
Issue
My issue is that the foreach doesn't iterate at all, the lambda is never executed and I don't know why.
I implemented it as indicated here: How to traverse/iterate a Dataset in Spark Java?
At the end, all the inner Vectors remain empty (as they were initialized) despite the Dataset is not (Take a look to the first comments in the given code sample).
I know that the foreach never iterates because I did two tests:
- Add an
AtomicInteger
to count the iterations, increment it right in the beginning of the lambda withincrementAndGet()
method. => The counter value remains0
at the end of the process. - Print a debug message right in the beginning of the lambda. => The message is never displayed.
I'm not used of Java (even less with Java lambdas) so maybe I missed an important point but I can't find what.