0

Let's say I have the following dataframe:

my_x = [([1,100]), ([2]), ([3,2])] my_df = spark.createDataFrame(my_x, ArrayType(IntegerType()))

Now, I want to extract the first element (int) from each array-row. So the final dataframe would have 1,2,3 (one per row). Is there a way of doing this without using a UDF? I tried doing something like

my_df.withColumn("casted", my_df.value.getItem(IntegerType()))

to no avail.

Thanks!

information_interchange
  • 2,538
  • 6
  • 31
  • 49
  • 1
    Possible duplicate of [How to extract an element from a array in pyspark](https://stackoverflow.com/questions/45254928/how-to-extract-an-element-from-a-array-in-pyspark) – pault Aug 21 '19 at 17:04

3 Answers3

0

Select the 0th position :

my_df.show()
+--------+
|   value|
+--------+
|[1, 100]|
|     [2]|
|  [3, 2]|
+--------+

my_df.withColumn('casted', my_df['value'][0]).show()
+--------+------+
|   value|casted|
+--------+------+
|[1, 100]|     1|
|     [2]|     2|
|  [3, 2]|     3|
+--------+------+
SMaZ
  • 2,515
  • 1
  • 12
  • 26
0

A different approach from the above:

    from pyspark.sql.types import ArrayType, IntegerType
    my_x = [([1,100]), ([2]), ([3,2])]
    my_df = spark.createDataFrame(my_x, ArrayType(IntegerType()))

    my_df = my_df.withColumn("firstVal", col("value").getItem([0]))

This should return a dataframe consisting of two columns:

    +--------+--------+
    |   value|FirstVal|
    +--------+--------+
    |[1, 100]|       1|
    |     [2]|       2|
    |  [3, 2]|       3|
    +--------+--------+
shadow_dev
  • 130
  • 1
  • 1
  • 14
0

You can also use element_at function:

from pyspark.sql.types import ArrayType, IntegerType
from pyspark.sql import functions as F
x = [([1,100]), ([2]), ([3,2])]
df = spark.createDataFrame(x, ArrayType(IntegerType()))
df = df.withColumn('extract', F.element_at(F.col('value'), 1))
df.show()

+--------+-------+
|   value|extract|
+--------+-------+
|[1, 100]|      1|
|     [2]|      2|
|  [3, 2]|      3|
+--------+-------+

niuer
  • 1,589
  • 2
  • 11
  • 14