0

Context

I want to iterate over a Spark Dataset and update a HashMap for each row.

Here is the code I have:

// At this point, I have a my_dataset variable containing 300 000 rows and 10 columns
// - my_dataset.count() == 300 000
// - my_dataset.columns().length == 10

// Declare my HashMap
HashMap<String, Vector<String>> my_map = new HashMap<String, Vector<String>>();

// Initialize the map
for(String col : my_dataset.columns())
{
    my_map.put(col, new Vector<String>());
}

// Iterate over the dataset and update the map
my_dataset.foreach( (ForeachFunction<Row>) row -> {
    for(String col : my_map.KeySet())
    {
        my_map.get(col).add(row.get(row.fieldIndex(col)).toString());
    }
});

Issue

My issue is that the foreach doesn't iterate at all, the lambda is never executed and I don't know why.
I implemented it as indicated here: How to traverse/iterate a Dataset in Spark Java?

At the end, all the inner Vectors remain empty (as they were initialized) despite the Dataset is not (Take a look to the first comments in the given code sample).

I know that the foreach never iterates because I did two tests:

  • Add an AtomicInteger to count the iterations, increment it right in the beginning of the lambda with incrementAndGet() method. => The counter value remains 0 at the end of the process.
  • Print a debug message right in the beginning of the lambda. => The message is never displayed.

I'm not used of Java (even less with Java lambdas) so maybe I missed an important point but I can't find what.

Fareanor
  • 5,900
  • 2
  • 11
  • 37

1 Answers1

1

I am probably a little old school, but I never like lambdas too much, as it can get pretty complicated.

Here is a full example of a foreach():

package net.jgp.labs.spark.l240_foreach.l000;

import java.io.Serializable;

import org.apache.spark.api.java.function.ForeachFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class ForEachBookApp implements Serializable {
  private static final long serialVersionUID = -4250231621481140775L;

  private final class BookPrinter implements ForeachFunction<Row> {
    private static final long serialVersionUID = -3680381094052442862L;

    @Override
    public void call(Row r) throws Exception {
      System.out.println(r.getString(2) + " can be bought at " + r.getString(
          4));
    }
  }

  public static void main(String[] args) {
    ForEachBookApp app = new ForEachBookApp();
    app.start();
  }

  private void start() {
    SparkSession spark = SparkSession.builder().appName("For Each Book").master(
        "local").getOrCreate();

    String filename = "data/books.csv";
    Dataset<Row> df = spark.read().format("csv").option("inferSchema", "true")
        .option("header", "true")
        .load(filename);
    df.show();

    df.foreach(new BookPrinter());
  }
}

As you can see, this example reads a CSV file and prints a message from the data. It is fairly simple.

The foreach() instantiates a new class, where the work is done.

df.foreach(new BookPrinter());

The work is done in the call() method of the class:

  private final class BookPrinter implements ForeachFunction<Row> {

    @Override
    public void call(Row r) throws Exception {
...
    }
  }

As you are new to Java, make sure you have the right signature (for classes and methods) and the right imports.

You can also clone the example from https://github.com/jgperrin/net.jgp.labs.spark/tree/master/src/main/java/net/jgp/labs/spark/l240_foreach/l000. This should help you with foreach().

jgp
  • 2,069
  • 1
  • 21
  • 40
  • I see, but I don't want to print rows, I want to update a HashMap that would not be known inside the `ForeachFunction`'s derived object. I'll try it with making my variable global (or better, member of an outer class so that it will be known), I'll tell you if it solves my issue. – Fareanor Nov 10 '21 at 15:54
  • I was more suggesting this code to see if you could iterate through your dataframe, to see if anything would happen and build from there, then focus on the map... How familiar are you with Spark's architecture and where things are running? The main code runs on the driver node and the ForeachFunction will be serialized to the executor node, so you will not be able to share a memory object between the two... – jgp Nov 10 '21 at 16:04
  • I get a NotSerializableException. I have no knowledge of Spark really, I'm just required to use it. But I guess you last comment may explain why it never iterates if it's not possible to execute my code. But how ? How is it impossible to just iterate over a container in Spark ? Like a basic for loop. – Fareanor Nov 10 '21 at 16:24
  • Maybe you should start another question and I'd be happy to answer it as well... but in a nutshell, with Spark, you are having this logical container containing the data and offering a pretty robust API, the dataframe. The way you deal with the data is that you "triture" the dataframe à la mort, until you get the shape of the data u want, then you dump it where you need it. As you transform the data with the Spark API, you leverage the distributed processing... Does that make sense? – jgp Nov 10 '21 at 19:51
  • 1
    Oh I see, finally since I only have 300 000 by 300 000 rows, I used the function `collectAsList()` to do my iterations, even if I'm not fond of copying everything like that in memory just to iterate over it... Thanks for your help ! Your answer didn't help me but your comments did so I give you the checkmark :) – Fareanor Nov 12 '21 at 07:53
  • Thanks - I don’t know your final use case (as in why do you want to populate a list?) but it’s a common temptation I see with Spark developers: do the minimum with Spark, then bring it “home” in an environment I know better, whereas most of this work could maybe be done directly by Spark… that’s where a `foreach()` can become extremely popular. – jgp Nov 12 '21 at 11:29
  • This is exactly the point of my question, since I can't do the `foreach()` given by Spark, I have no other choice than collect() the data out of the spark dataset. I populate a HashMap because I need to concatenate all rows in one single string (hence why I populate a HashMap because I want to do this for each column, as my code sample shows). It works currently. But if Spark can do such a thing, that I didn't know. – Fareanor Nov 12 '21 at 12:36
  • Spark can concatenate strings and do many operations directly on the dataframe. You can do fairly complex transactions. You can see some here: https://github.com/jgperrin/net.jgp.books.spark.ch12 (records) and https://github.com/jgperrin/net.jgp.books.spark.ch13 (documents). – jgp Nov 12 '21 at 13:14