0

I use Spark 2.0.1.

I am trying to find distinct values in a JavaRDD as below

JavaRDD<String> distinct_installedApp_Ids = filteredInstalledApp_Ids.distinct();

I see that this line is throwing the below exception

Exception in thread "main" java.lang.StackOverflowError
    at org.apache.spark.rdd.RDD.checkpointRDD(RDD.scala:226)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
    at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
    at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
    at scala.collection.AbstractTraversable.map(Traversable.scala:105)
    at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:84)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
    at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
    at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
   ..........

The same stacktrace is repeated again and again. The input filteredInstalledApp_Ids has large input with millions of records.Will thh issue be the number of records or is there a efficient way to find distinct values in JavaRDD. Any help would be much appreciated. Thanks in advance. Cheers.

Edit 1:

Adding the filter method

JavaRDD<String> filteredInstalledApp_Ids = installedApp_Ids
        .filter(new Function<String, Boolean>() {
        @Override
          public Boolean call(String v1) throws Exception {
                return v1 != null;
            }
          }).cache();

Edit 2:

Added the method used to generate installedApp_Ids

 public JavaRDD<String> getIdsWithInstalledApps(String inputPath, JavaSparkContext sc,
        JavaRDD<String> installedApp_Ids) {

    JavaRDD<String> appIdsRDD = sc.textFile(inputPath);
    try {
        JavaRDD<String> appIdsRDD1 = appIdsRDD.map(new Function<String, String>() {
            @Override
            public String call(String t) throws Exception {
                String delimiter = "\t";
                String[] id_Type = t.split(delimiter);
                StringBuilder temp = new StringBuilder(id_Type[1]);
                if ((temp.indexOf("\"")) != -1) {
                    String escaped = temp.toString().replace("\\", "");
                    escaped = escaped.replace("\"{", "{");
                    escaped = escaped.replace("}\"", "}");
                    temp = new StringBuilder(escaped);
                }
                // To remove empty character in the beginning of a
                // string
                JSONObject wholeventObj = new JSONObject(temp.toString());
                JSONObject eventJsonObj = wholeventObj.getJSONObject("eventData");
                int appType = eventJsonObj.getInt("appType");
                if (appType == 1) {
                    try {                           
                        return (String.valueOf(appType));
                    } catch (JSONException e) {
                        return null;
                    }
                }
                return null;
            }
        }).cache();
        if (installedApp_Ids != null)
            return sc.union(installedApp_Ids, appIdsRDD1);
        else
            return appIdsRDD1;
    } catch (Exception e) {
        e.printStackTrace();
    }
    return null;
}
Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420

1 Answers1

0

I assume the main dataset is in inputPath. It appears that it's a comma-separated file with JSON-encoded values.

I think you could make your code a bit simpler by combination of Spark SQL's DataFrames and from_json function. I'm using Scala and leave converting the code to Java as a home exercise :)

The lines where you load a inputPath text file and the line parsing itself can be as simple as the following:

import org.apache.spark.sql.SparkSession
val spark: SparkSession = ...
val dataset = spark.read.csv(inputPath)

You can display the content using show operator.

dataset.show(truncate = false)

You should see the JSON-encoded lines.

It appears that the JSON lines contain eventData and appType fields.

val jsons = dataset.withColumn("asJson", from_json(...))

See functions object for reference.

With JSON lines, you can select the fields of your interest:

val apptypes = jsons.select("eventData.appType")

And then union it with installedApp_Ids.

I'm sure the code gets easier to read (and hopefully to write too). The migration will give you extra optimizations that you may or may not be able to write yourself using assembler-like RDD API.

And the best is that filtering out nulls is as simple as using na operator that gives DataFrameNaFunctions like drop. I'm sure you'll like them.


It does not necessarily answer your initial question, but this java.lang.StackOverflowError might get away just by doing the code migration and the code gets easier to maintain, too.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420