4

I have a dataframe that contains the following:

movieId / movieName / genre
1         example1    action|thriller|romance
2         example2    fantastic|action

I would like to obtain a second dataframe (from the first one), that contains the following:

movieId / movieName / genre
1         example1    action
1         example1    thriller
1         example1    romance
2         example2    fantastic
2         example2    action

How can we do it using pyspark?

Codegator
  • 459
  • 7
  • 28

1 Answers1

6

Use split function will return an array then explode function on array.

Example:

df.show(10,False)
#+-------+---------+-----------------------+
#|movieid|moviename|genre                  |
#+-------+---------+-----------------------+
#|1      |example1 |action|thriller|romance|
#+-------+---------+-----------------------+

from pyspark.sql.functions import *

df.withColumnRenamed("genre","genre1").\
withColumn("genre",explode(split(col("genre1"),'\\|'))).\
drop("genre1").\
show()
#+-------+---------+--------+
#|movieid|moviename|   genre|
#+-------+---------+--------+
#|      1| example1|  action|
#|      1| example1|thriller|
#|      1| example1| romance|
#+-------+---------+--------+
notNull
  • 30,258
  • 4
  • 35
  • 50
  • 1
    Thanks. This works too. df.withColumn("genre",explode(split(col("genre"),'\\|'))).show() any reason you added genre1 column and then dropped it? – Codegator Aug 18 '20 at 02:07