Best Approach for duplicate filtering in a Java Spark JavaPairRDD

Question

I am learning spark and would like to seek best approaches for solving the below problem.

I have 2 datasets users and transactions as below and would like to join them to find unique locations per item sold.

The headers for the files are as below

id,email,language,location ----------- USER HEADERS
txid,productid,userid,price,desc -------------------- TRANSACTION HEADERS

Below is my approach

/*
         * Load user data set into userDataFrame
         * Load transaction data set into transactionDataFrame
         * join both on user id - userTransactionFrame
         * select productid and location columns from the joined dataset into a new dataframe - productIdLocationDataFrame
         * convert the new dataframe into a javardd - productIdLocationJavaRDD
         * make the javardd a pair rdd - productIdLocationJavaPairRDD
         * group the pair rdd by key - productLocationList
         * apply mapvalues on the grouped key to convert the list of values to a set of valued for duplicate filtering - productUniqLocations
         * 
         * */

I am not very sure that I have done this the right way and still feel "can be done better, differently".

I am doubtful of the part where I have done duplicate filtering from the JavaPairRDD.

Please evaluate the approach and code and let me know better solutions.

Code

    SparkConf conf = new SparkConf();
    conf.setAppName("Sample App - Uniq Location per item");
    JavaSparkContext jsc = new JavaSparkContext("local[*]","A 1");
    //JavaSparkContext jsc = new JavaSparkContext(conf);
    SQLContext sqlContext = new SQLContext(jsc);

    //id    email   language    location ----------- USER HEADERS
    DataFrame userDataFrame = sqlContext.read()
            .format("com.databricks.spark.csv")
            .option("inferSchema", "true")
            .option("header", "true")
            .option("delimiter", "\t")
            .load("user");

    //txid  pid uid price   desc -------------------- TRANSACTION HEADERS
    DataFrame transactionDataFrame = sqlContext.read()
            .format("com.databricks.spark.csv")
            .option("inferSchema", "true")
            .option("header", "true")
            .option("delimiter", "\t")
            .load("transactions");

    Column joinColumn = userDataFrame.col("id").equalTo(transactionDataFrame.col("uid"));

    DataFrame userTransactionFrame = userDataFrame.join(transactionDataFrame,joinColumn,"rightouter");

    DataFrame productIdLocationDataFrame = userTransactionFrame.select(userTransactionFrame.col("pid"),userTransactionFrame.col("location"));

    JavaRDD<Row> productIdLocationJavaRDD = productIdLocationDataFrame.toJavaRDD();

    JavaPairRDD<String, String> productIdLocationJavaPairRDD = productIdLocationJavaRDD.mapToPair(new PairFunction<Row, String, String>() {

        public Tuple2<String, String> call(Row inRow) throws Exception {
            return new Tuple2(inRow.get(0),inRow.get(1));
        }
    });


    JavaPairRDD<String, Iterable<String>> productLocationList = productIdLocationJavaPairRDD.groupByKey();

    JavaPairRDD<String, Iterable<String>> productUniqLocations = productLocationList.mapValues(new Function<Iterable<String>, Iterable<String>>() {

        public Iterable<String> call(Iterable<String> inputValues) throws Exception {
            return new HashSet<String>((Collection<? extends String>) inputValues);
        }
    });

    productUniqLocations.saveAsTextFile("uniq");

The good part is that the code runs and generates the output that I expect.

score 3 · Accepted Answer · edited May 23 '17 at 12:00

The lowest hanging fruit is getting rid of groupByKey.

Using aggregateByKey should do the job since the output type of the value is different (we want a set per key).

Code in Scala :

 pairRDD.aggregateByKey(new java.util.HashSet[String])
((locationSet, location) => {locationSet.add(location); locationSet},
 (locSet1, locSet2) => {locSet1.addAll(locSet2); locSet1}
)

Java Equivalent:

Function2<HashSet<String>, String, HashSet<String>> sequenceFunction = new Function2<HashSet<String>, String, HashSet<String>>() {

            public HashSet<String> call(HashSet<String> aSet, String arg1) throws Exception {
                aSet.add(arg1);
                return aSet;
            }
        };

        Function2<HashSet<String>, HashSet<String>, HashSet<String>> combineFunc = new Function2<HashSet<String>, HashSet<String>, HashSet<String>>() {

            public HashSet<String> call(HashSet<String> arg0, HashSet<String> arg1) throws Exception {
                arg0.addAll(arg1);
                return arg0;
            }
        };

        JavaPairRDD<String, HashSet<String>> byKey = productIdLocationJavaPairRDD.aggregateByKey(new HashSet<String>(), sequenceFunction, combineFunc );

Secondly, joins work the best when the datasets are co-partitioned.

Since you are dealing with dataframes, partitioning out of box is not possible if you are using Spark < 1.6. Thus, you may want to read data into RDDs, partition them and then create dataframes. For your use case, it might be better to not involve dataframes at all.

Best Approach for duplicate filtering in a Java Spark JavaPairRDD

1 Answers1