7

How to write the equivalent function of arrays_zip in Spark 2.3?

Source code from Spark 2.4

def arrays_zip(*cols):
    """
    Collection function: Returns a merged array of structs in which the N-th struct contains all
    N-th values of input arrays.

    :param cols: columns of arrays to be merged.

    >>> from pyspark.sql.functions import arrays_zip
    >>> df = spark.createDataFrame([(([1, 2, 3], [2, 3, 4]))], ['vals1', 'vals2'])
    >>> df.select(arrays_zip(df.vals1, df.vals2).alias('zipped')).collect()
    [Row(zipped=[Row(vals1=1, vals2=2), Row(vals1=2, vals2=3), Row(vals1=3, vals2=4)])]
    """
    sc = SparkContext._active_spark_context
    return Column(sc._jvm.functions.arrays_zip(_to_seq(sc, cols, _to_java_column)))

How to achieve similar in PySpark?

Shaido
  • 27,497
  • 23
  • 70
  • 73
bp2010
  • 2,342
  • 17
  • 34
  • You can probably test: `f=lambda x,y:list(zip(x,y))` ; `myudf = F.udf(f,ArrayType(StructType([StructField('vals1',IntegerType(),False),StructField('vals2',IntegerType(),False)])))` followed by `df.select(myudf(F.col('vals1'),F.col('vals2'))).collect()` not sure hence not posting as an answer , remove the `F` prefix if you have not named the imports as `F` – anky Apr 29 '20 at 15:03

3 Answers3

1

You can use an UDF to obtain the same functionality as arrays_zip. Note that the column types need to be the same for this to work (in this case of IntegerType). If there are any differences in column types, convert the columns to a common type before using the UDF.

from pyspark.sql import functions as F
from pyspark.sql import types as T

def zip_func(*args):
    return list(zip(*args))

zip_udf = F.udf(zip_func, T.ArrayType(T.ArrayType(T.IntegerType())))

It can be used in the same way as arrays_zip, for example:

df = spark.createDataFrame([(([1, 2, 3], [2, 3, 4]))], ['vals1', 'vals2'])
df.select(zip_udf(df.vals1, df.vals2).alias('zipped')).collect()
Shaido
  • 27,497
  • 23
  • 70
  • 73
  • Did this run for you? I see a strange error: `net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for builtins.iter)` – bp2010 May 19 '20 at 12:58
  • @bp2010: I can't try out the code at the moment (would need to wait until tonight in my timezone), but the error is related to the return type not matching the udf declaration. I changed the code in the answer, try if it works for you. (if not using an udf with `return list([list(z) for z in zip(*args)])` would most definitely work, but I don't think it's necessary to do it that way.) – Shaido May 20 '20 at 01:35
  • Now that runs. But, I am trying to use this function to explode the zip. But now with this function I see the error: `org.apache.spark.sql.AnalysisException: Can only star expand struct data types. Attribute: ArrayBuffer(cols).` – bp2010 May 20 '20 at 06:13
  • @bp2010: Are you sure you are using `explode`? This looks like an error from `expand`. `expand` works on structs while in this case the zip returns an array of arrays. This could be fixed by returning an array of structs (see comment by andy on the question), but it will not be dynamic to the number of columns. – Shaido May 20 '20 at 06:49
  • Yes. I am using `explode`. The logic I posted here: https://stackoverflow.com/a/61087359/3213111 I was using arrays_zip to make use that it is dynamic as I need this. Any idea how to perform this in a dynamic manner for the columns? – bp2010 May 20 '20 at 07:05
  • I would say in general this answer does not meet the requirement of `arrays_zip`; the return type being "Returns a merged array of structs..." – bp2010 May 22 '20 at 09:10
1

You can achieve this by creating User Defined Function

import pyspark.sql.functions as f
import pyspark.sql.types as t

arrays_zip_ = f.udf(lambda x, y: list(zip(x, y)),  
      t.ArrayType(t.StructType([
          # Choose Datatype according to requirement
          t.StructField("first", t.IntegerType()),
          t.StructField("second", t.StringType())
  ])))

df = spark.createDataFrame([(([1, 2, 3], ['2', '3', '4']))], ['first', 'second'])

Now results with spark<=2.3

df.select(arrays_zip_('first', 'second').alias('zipped')).show(2,False)

+------------------------+
|zipped                  |
+------------------------+
|[[1, 2], [2, 3], [3, 4]]|
+------------------------+

And result with Spark version 2.4

df.select(f.arrays_zip('first', 'second').alias('zipped')).show(2,False)

+------------------------+
|zipped                  |
+------------------------+
|[[1, 2], [2, 3], [3, 4]]|
+------------------------+
Shubham Jain
  • 5,327
  • 2
  • 15
  • 38
  • 1
    The above will only work for 2 arrays while `arrays_zip` works for any number of arrays. – Shaido May 07 '20 at 05:23
  • This provides you the feasibility to assume the datatype being merged. And we can create this code dynamic anytime. – Shubham Jain May 07 '20 at 05:27
  • 1
    Can this be made dynamic, with a dynamic set of columns, not fixed as above? – bp2010 May 19 '20 at 13:04
  • Using the current function with an Array gives the error: `TypeError: () missing 1 required positional argument: 'y'` – bp2010 May 20 '20 at 10:46
0

You can simply use f.array, but you have to get the values later by index not by column name (that's the only difference).

from pyspark.sql import functions as f

df = df.withColumn('combined', f.array(f.col('col1'), f.col('col2'), f.col('col3')))
Sajad Norouzi
  • 1,790
  • 2
  • 17
  • 26