WrappedArray of WrapedArray to java array

Question

I have a column of type set and I use collect_set() of spark Dataset API which returns a wrapped array of wrapped array. I want a single array from all values of the nested wrapped arrays. How can I do that?

Eg. Cassandra table:

Col1  
{1,2,3}
{1,5}

I'm using Spark Dataset API.
row.get(0) returns a wrapped array of wrapped array.

abaghel · Accepted Answer · 2017-08-17T11:21:37.083

Consider you have Dataset<Row> ds which has value column.

+-----------------------+
|value                  |
+-----------------------+
|[WrappedArray(1, 2, 3)]|
+-----------------------+

And it has below schema

root
 |-- value: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: integer (containsNull = false)

Using UDF

Define UDF1 like below.

static UDF1<WrappedArray<WrappedArray<Integer>>, List<Integer>> getValue = new UDF1<WrappedArray<WrappedArray<Integer>>, List<Integer>>() {
      public List<Integer> call(WrappedArray<WrappedArray<Integer>> data) throws Exception {
        List<Integer> intList = new ArrayList<Integer>();
        for(int i=0; i<data.size(); i++){
            intList.addAll(JavaConversions.seqAsJavaList(data.apply(i)));
        }
        return intList;
    }
};

Register and call UDF1 like below

import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.callUDF;
import scala.collection.JavaConversions;

//register UDF
spark.udf().register("getValue", getValue, DataTypes.createArrayType(DataTypes.IntegerType));

//Call UDF
Dataset<Row> ds1  = ds.select(col("*"), callUDF("getValue", col("value")).as("udf-value"));
ds1.show();

Using explode function

import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.explode;

Dataset<Row> ds2 = ds.select(explode(col("value")).as("explode-value"));
ds2.show(false);

yes, that can be done, I tried it other way, I exploded the sets and then aggregated them using collect_set(), so there was only single array. You are telling me to explode the result of collect_set(). In both cases, I have a concern, that is whether there will be a significant performance hit or not? That's why I was opting for flatten. Also can you point me to some tutorials, books, etc for spark+java(and not scala)+dataset api — rohanagarwal, Jul 26 '17 at 18:38
got this error, did i miss anything?The method seqAsJavaList(Seq) in the type JavaConversions is not applicable for the arguments (Integer) — Q.W., Jul 11 '22 at 19:38

score 0 · Answer 2 · edited Jul 26 '17 at 13:41

If you have a dataframe you can use a udf to flattern a list Below is simple example

import spark.implicits._

import org.apache.spark.sql.functions._
//create a dummy data

val df = Seq(
  (1, List(1,2,3)),
  (1, List (5,7,9)),
  (2, List(4,5,6)),
  (2,List(7,8,9))
).toDF("id", "list")

val df1 = df.groupBy("id").agg(collect_set($"list").as("col1"))

df1.show(false)

Output for df1:

+---+----------------------------------------------+
|id |col1                                          |
+---+----------------------------------------------+
|1  |[WrappedArray(1, 2, 3), WrappedArray(5, 7, 9)]|
|2  |[WrappedArray(7, 8, 9), WrappedArray(4, 5, 6)]|
+---+----------------------------------------------+


val testUDF = udf((list: Seq[Seq[Integer]]) => {list.flatten})


df1.withColumn("newCol", testUDF($"col1")).show(false)

Output

+---+----------------------------------------------+------------------+
|id |col1                                          |newCol            |
+---+----------------------------------------------+------------------+
|1  |[WrappedArray(1, 2, 3), WrappedArray(5, 7, 9)]|[1, 2, 3, 5, 7, 9]|
|2  |[WrappedArray(7, 8, 9), WrappedArray(4, 5, 6)]|[7, 8, 9, 4, 5, 6]|
+---+----------------------------------------------+------------------+

I hope this helps!

can you please post a java equivalent code for udf . I saw this flatten function on Seq> but could not use it properly. — rohanagarwal, Jul 26 '17 at 12:56
I hope this helps https://stackoverflow.com/questions/35348058/how-do-i-call-a-udf-on-a-spark-dataframe-using-java — koiralo, Jul 26 '17 at 13:25
Actually i want the implementation for flatten, it is not as simple as list.flatten in Java may be because Scala is richer. Docs for flatten is single line, doesn't make sense to me :( — rohanagarwal, Jul 26 '17 at 13:49
you can write a udf and loop through the array and than create a new array that will be flattern. — koiralo, Jul 26 '17 at 14:35

WrappedArray of WrapedArray to java array

2 Answers2