0

From official doc: https://spark.apache.org/docs/latest/api/sql/index.html#array_size , it is present from Spark 3.3.0 but I need the same in Spark 3.2.0

Is there some alternative for array_size that I can use while writing SQL query for data residing in Apache Iceberg table. (SQL query is then run through Apache Spark 3.2.2)

Alok Singh
  • 31
  • 5

1 Answers1

0

Here is some sample code to do that in a DataFrame, using size function and explode:

from pyspark.sql import SparkSession
from pyspark.sql.functions import size, explode

spark = SparkSession.builder.appName("AlternativeArray_Size").getOrCreate()

data = [(1, [10, 20, 30]), (2, [40, 50]), (3, [60])]
columns = ["id", "values"]
df = spark.createDataFrame(data, columns)

result = df.withColumn("array_size", size(explode(df.values)))

In this example, the explode function is used to transform each array element into a separate row, and then the size function is used to count the number of rows.

If an array is present, explode is not necessary, and this should work size(array('b', 'd', 'c', 'a')) (returns 4)

size(map('a', 1, 'b', 2))returns 2. If the expression is NULL, the function returns -1 instead of NULL. The function returns null for null input if spark.sql.legacy.sizeOfNull is set to false or spark.sql.ansi.enabled is set to true.

In actual fact, there is not much difference between array_size and size, what values they give for NULL is different for example, and array_size only takes arrays, whereas size can also be used for maps as in example above.

Source: Documentation

  • If you could please help in finding out the right alternative for array_size function for SQL Query to be written to query data from Iceberg Table? (the SQL query is fed to Apache Spark 3.2.2 eventually to run it) – Alok Singh Aug 20 '23 at 06:06
  • @AlokSingh, I doubt its different for Iceberg table as source. these functions should be valid for all sources. Can you try this and let me know the output? – Ziya Mert Karakas Aug 20 '23 at 09:20
  • I can't try this as all I can write is SQL query and not the Spark code – Alok Singh Aug 20 '23 at 11:56
  • ```spark.sql(""" SELECT id, values, size(array(YourArray)) AS arraySize FROM ( SELECT id, values, explode(values) AS YourArray FROM data ) """)``` Simply use the size function on your array inside the SQL statement... https://spark.apache.org/docs/latest/api/sql/index.html#explode – Ziya Mert Karakas Aug 20 '23 at 12:08