36

I have a data frame with following type:

col1|col2|col3|col4
xxxx|yyyy|zzzz|[1111],[2222]

I want my output to be following type:

col1|col2|col3|col4|col5
xxxx|yyyy|zzzz|1111|2222

My col4 is an array and I want to convert it to a separate column. What needs to be done?

I saw many answers with flatMap, but they are increasing a row, I want just the tuple to be put in another column but in the same row

The following is my actual schema:

root
 |-- PRIVATE_IP: string (nullable = true)
 |-- PRIVATE_PORT: integer (nullable = true)
 |-- DESTINATION_IP: string (nullable = true)
 |-- DESTINATION_PORT: integer (nullable = true)
 |-- collect_set(TIMESTAMP): array (nullable = true)
 |    |-- element: string (containsNull = true)

Also, can please someone help me with explanation on both dataframes and RDD's.

Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
AnmolDave
  • 395
  • 2
  • 4
  • 6
  • 1
    What's the schema of your data frame? Can you show df.printSchema()? – Psidom Jul 22 '17 at 13:26
  • Hi, I edited the question with my actual schema – AnmolDave Jul 22 '17 at 13:41
  • Does all cells in the array column have the same number of elements? Always 2? What if another row have three elements in the array? – Psidom Jul 22 '17 at 13:47
  • No all the elements have exactly 2 elements. Because the element in the array are a start date and end date. – AnmolDave Jul 22 '17 at 13:56
  • Also this is my actual requirment if you can help me with it. https://stackoverflow.com/questions/45252906/combine-two-rows-in-spark-based-on-a-condition-in-pyspark?noredirect=1#comment77471048_45252906 – AnmolDave Jul 22 '17 at 14:00

2 Answers2

63

Create sample data:

from pyspark.sql import Row
x = [Row(col1="xx", col2="yy", col3="zz", col4=[123,234])]
rdd = sc.parallelize([Row(col1="xx", col2="yy", col3="zz", col4=[123,234])])
df = spark.createDataFrame(rdd)
df.show()
#+----+----+----+----------+
#|col1|col2|col3|      col4|
#+----+----+----+----------+
#|  xx|  yy|  zz|[123, 234]|
#+----+----+----+----------+

Use getItem to extract element from the array column as this, in your actual case replace col4 with collect_set(TIMESTAMP):

df = df.withColumn("col5", df["col4"].getItem(1)).withColumn("col4", df["col4"].getItem(0))
df.show()
#+----+----+----+----+----+
#|col1|col2|col3|col4|col5|
#+----+----+----+----+----+
#|  xx|  yy|  zz| 123| 234|
#+----+----+----+----+----+
pault
  • 41,343
  • 15
  • 107
  • 149
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • 4
    @Lydia please be *extremely careful* and sure you know what you are doing when altering code: your edit had ruined a perfectly good answer, leading it to throw an exception (restored it to OP's original)... – desertnaut Oct 25 '18 at 10:17
  • 1
    Do you have a way to generalize the iteration over the original col4's array ? – Amesys Dec 02 '19 at 12:59
  • @Amesys Did you try destructuring a list comprehension ? – Union find Apr 03 '20 at 17:50
  • I have a follow-up question, dropping the link, thanks in advance! https://stackoverflow.com/questions/61823544/pyspark-mapping-multiple-columns @Psidom – jgtrz May 15 '20 at 20:20
  • What is efficient way to apply this to more than 10 columns but each column has only one item in list – haneulkim Jan 25 '22 at 15:18
5

You have 4 options to extract the value inside the array:

df = spark.createDataFrame([[1, [10, 20, 30, 40]]], ['A', 'B'])
df.show()

+---+----------------+
|  A|               B|
+---+----------------+
|  1|[10, 20, 30, 40]|
+---+----------------+

from pyspark.sql import functions as F

df.select(
    "A",
    df.B[0].alias("B0"), # dot notation and index        
    F.col("B")[1].alias("B1"), # function col and index
    df.B.getItem(2).alias("B2"), # dot notation and method getItem
    F.col("B").getItem(3).alias("B3"), # function col and method getItem
).show()

+---+---+---+---+---+
|  A| B0| B1| B2| B3|
+---+---+---+---+---+
|  1| 10| 20| 30| 40|
+---+---+---+---+---+

In case you have many columns, use a list comprehension:

df.select(
    'A', *[F.col('B')[i].alias(f'B{i}') for i in range(4)]
).show()

+---+---+---+---+---+
|  A| B0| B1| B2| B3|
+---+---+---+---+---+
|  1| 10| 20| 30| 40|
+---+---+---+---+---+
Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73