2

I have a column of type set and I use collect_set() of spark Dataset API which returns a wrapped array of wrapped array. I want a single array from all values of the nested wrapped arrays. How can I do that?

Eg. Cassandra table:

Col1  
{1,2,3}
{1,5}

I'm using Spark Dataset API.
row.get(0) returns a wrapped array of wrapped array.

abaghel
  • 14,783
  • 2
  • 50
  • 66
rohanagarwal
  • 771
  • 9
  • 30

2 Answers2

8

Consider you have Dataset<Row> ds which has value column.

+-----------------------+
|value                  |
+-----------------------+
|[WrappedArray(1, 2, 3)]|
+-----------------------+

And it has below schema

root
 |-- value: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: integer (containsNull = false)

Using UDF

Define UDF1 like below.

static UDF1<WrappedArray<WrappedArray<Integer>>, List<Integer>> getValue = new UDF1<WrappedArray<WrappedArray<Integer>>, List<Integer>>() {
      public List<Integer> call(WrappedArray<WrappedArray<Integer>> data) throws Exception {
        List<Integer> intList = new ArrayList<Integer>();
        for(int i=0; i<data.size(); i++){
            intList.addAll(JavaConversions.seqAsJavaList(data.apply(i)));
        }
        return intList;
    }
};

Register and call UDF1 like below

import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.callUDF;
import scala.collection.JavaConversions;

//register UDF
spark.udf().register("getValue", getValue, DataTypes.createArrayType(DataTypes.IntegerType));

//Call UDF
Dataset<Row> ds1  = ds.select(col("*"), callUDF("getValue", col("value")).as("udf-value"));
ds1.show();

Using explode function

import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.explode;

Dataset<Row> ds2 = ds.select(explode(col("value")).as("explode-value"));
ds2.show(false);
abaghel
  • 14,783
  • 2
  • 50
  • 66
0

If you have a dataframe you can use a udf to flattern a list Below is simple example

import spark.implicits._

import org.apache.spark.sql.functions._
//create a dummy data

val df = Seq(
  (1, List(1,2,3)),
  (1, List (5,7,9)),
  (2, List(4,5,6)),
  (2,List(7,8,9))
).toDF("id", "list")

val df1 = df.groupBy("id").agg(collect_set($"list").as("col1"))

df1.show(false)

Output for df1:

+---+----------------------------------------------+
|id |col1                                          |
+---+----------------------------------------------+
|1  |[WrappedArray(1, 2, 3), WrappedArray(5, 7, 9)]|
|2  |[WrappedArray(7, 8, 9), WrappedArray(4, 5, 6)]|
+---+----------------------------------------------+


val testUDF = udf((list: Seq[Seq[Integer]]) => {list.flatten})


df1.withColumn("newCol", testUDF($"col1")).show(false)

Output

+---+----------------------------------------------+------------------+
|id |col1                                          |newCol            |
+---+----------------------------------------------+------------------+
|1  |[WrappedArray(1, 2, 3), WrappedArray(5, 7, 9)]|[1, 2, 3, 5, 7, 9]|
|2  |[WrappedArray(7, 8, 9), WrappedArray(4, 5, 6)]|[7, 8, 9, 4, 5, 6]|
+---+----------------------------------------------+------------------+

I hope this helps!

mrsrinivas
  • 34,112
  • 13
  • 125
  • 125
koiralo
  • 22,594
  • 6
  • 51
  • 72
  • can you please post a java equivalent code for udf . I saw this flatten function on Seq> but could not use it properly. – rohanagarwal Jul 26 '17 at 12:56
  • I hope this helps https://stackoverflow.com/questions/35348058/how-do-i-call-a-udf-on-a-spark-dataframe-using-java – koiralo Jul 26 '17 at 13:25
  • Actually i want the implementation for flatten, it is not as simple as list.flatten in Java may be because Scala is richer. Docs for flatten is single line, doesn't make sense to me :( – rohanagarwal Jul 26 '17 at 13:49
  • you can write a udf and loop through the array and than create a new array that will be flattern. – koiralo Jul 26 '17 at 14:35