8

I want to adapt my Java Spark app (which actually uses RDDs for some calculations) to use Datasets instead of RDDs. I'm new to Datasets and not sure how to map which transaction to a corresponding Dataset operation.

At the moment I map them like this:

JavaSparkContext.textFile(...)                       -> SQLContext.read().textFile(...)
JavaRDD.filter(Function)                             -> Dataset.filter(FilterFunction)
JavaRDD.map(Function)                                -> Dataset.map(MapFunction)
JavaRDD.mapToPair(PairFunction)                      -> Dataset.groupByKey(MapFunction) ???
JavaPairRDD.aggregateByKey(U, Function2, Function2)  -> KeyValueGroupedDataset.???

And the corresponing questions are:

  • Equals JavaRDD.mapToPair the Dataset.groupByKey method?
  • Does JavaPairRDD map to KeyValueGroupedDataset?
  • Which method equals the JavaPairRDD.aggregateByKey method?

However, I want to port the following RDD code into a Dataset one:

JavaRDD<Article> goodRdd = ...

JavaPairRDD<String, Article> ArticlePairRdd = goodRdd.mapToPair(new PairFunction<Article, String, Article>() {              // Build PairRDD<<Date|Store|Transaction><Article>>
    public Tuple2<String, Article> call(Article article) throws Exception {
        String key = article.getKeyDate() + "|" + article.getKeyStore() + "|" + article.getKeyTransaction() + "|" + article.getCounter();
        return new Tuple2<String, Article>(key, article);
    }
});

JavaPairRDD<String, String> transactionRdd = ArticlePairRdd.aggregateByKey("",                                              // Aggregate distributed data -> PairRDD<String, String>
    new Function2<String, Article, String>() {
        public String call(String oldString, Article newArticle) throws Exception {
            String articleString = newArticle.getOwg() + "_" + newArticle.getTextOwg();                                     // <<Date|Store|Transaction><owg_textOwg###owg_textOwg>>
            return oldString + "###" + articleString;
        }
    }, 
    new Function2<String, String, String>() {
        public String call(String a, String b) throws Exception {
            String c = a.concat(b);
            ...
            return c;
        }
    }
);

My code looks this yet:

Dataset<Article> goodDS = ...

KeyValueGroupedDataset<String, Article> ArticlePairDS = goodDS.groupByKey(new MapFunction<Article, String>() {
    public String call(Article article) throws Exception {
        String key = article.getKeyDate() + "|" + article.getKeyStore() + "|" + article.getKeyTransaction() + "|" + article.getCounter();
        return key;
    }
}, Encoders.STRING());

// here I need something similar to aggregateByKey! Not reduceByKey as I need to return another data type (String) than I have before (Article)
D. Müller
  • 3,336
  • 4
  • 36
  • 84

0 Answers0