0

We need to calculate distance matrix like jaccard on huge collection of Dataset in spark. Facing couple of issues. Kindly help us to give directions.

Issue 1

    import info.debatty.java.stringsimilarity.Jaccard;

    //sample Data set creation
    List<Row> data = Arrays.asList(
                RowFactory.create("Hi I heard about Spark", "Hi I Know about Spark"),
                RowFactory.create("I wish Java could use case classes","I wish C# could use case classes"),
                RowFactory.create("Logistic,regression,models,are,neat","Logistic,regression,models,are,neat"));

    StructType schema = new StructType(new StructField[] {new StructField("label", DataTypes.StringType, false,Metadata.empty()),
                new StructField("sentence", DataTypes.StringType, false,Metadata.empty()) });
                Dataset<Row> sentenceDataFrame = spark.createDataFrame(data, schema);

                // Distance matrix object creation
                Jaccard jaccard=new Jaccard();

                //Working on each of the member element of dataset and applying distance matrix.
                Dataset<String> sentenceDataFrame1 =sentenceDataFrame.map(
                        (MapFunction<Row, String>) row -> "Name: " + jaccard.similarity(row.getString(0),row.getString(1)),Encoders.STRING()
                );
                sentenceDataFrame1.show();

No compile time errors. But getting run time exception like:

org.apache.spark.SparkException: Task not serializable

Issue 2
Moreover we need to find which pair is having highest score for which we need to declare some variables. Also we need to perform other calculation as well, we are facing lots of difficulty.
Even if I try to declare a simple variable like counter within MapBlock we are not able to capture the incremented value. If we declare outside the Map block we are getting lots of compile time errors.

    int counter=0;
        Dataset<String> sentenceDataFrame1 =sentenceDataFrame.map(
                (MapFunction<Row,  String>) row -> {
                    System.out.println("Name: " + row.getString(1));
                    //int counter = 0;
                    counter++;
                    System.out.println("Counter: " + counter);
                    return counter+"";

                },Encoders.STRING()

        );

Please gives us directions. Thank You.

Nischay
  • 168
  • 2
  • 14

1 Answers1

1

Jaccard jaccard=new Jaccard();

is This class serializable ?

In spark, all the code that you write inside Transformations are instantiated on driver serialized and sent to executors.

As you have use lambda functions :

  1. all the classes used from outer class inside lambda needs to be serializable.

  2. IF one uses even a method from outer class inside lambda , it expects the outer class to be serializable.

To have in detail understanding please refer to :

http://bytepadding.com/big-data/spark/spark-code-analysis/

http://bytepadding.com/big-data/spark/understanding-spark-serialization/

Part 2:

  1. Try to find cartesian product N CROSS N in spark.
  2. Try to find more clever algorithm to find the pair.

More inputs on the question will help in providing a better answer.

KrazyGautam
  • 2,839
  • 2
  • 21
  • 31