pyspark.sql understanding syntax and difference between functions explode and split

Question

I am taking a mooc.

It has a shakespeareDF dataframe that has below text

word                                             |
+-------------------------------------------------+
|1609                                             |
|                                                 |
|the sonnets                                      |
|                                                 |
|by william shakespeare                           |
|                                                 |
|                                                 |
|                                                 |
|1                                                |
|from fairest creatures we desire increase        |
|that thereby beautys rose might never die        |
|but as the riper should by time decease          |
|his tender heir might bear his memory            |
|but thou contracted to thine own bright eyes     |
|feedst thy lights flame with selfsubstantial fuel|
+-------------------------------------------------+

On it, they run below code

from pyspark.sql.functions import split, explode
shakeWordsDF = (shakespeareDF.select(explode(split(shakespeareDF[0],"\s+"))

I would like to understand:

what is difference between explode and split and why do we have to use both? I tried to look into the online documentation and couldnt understand
why do we have to use shakespeareDF[0] and not just shakespeareDF

score 0 · Answer 1 · edited May 23 '17 at 12:08

0

Q.1 look here

Q.2 shakespeareDF[0] -- selecting the first column

edited May 23 '17 at 12:08

Community

1
1

answered Jul 24 '16 at 02:30

Tinto James

121
1
8

if shakespeareDF had only 1 column then do we need to use shakespeareDF[0]? – user2543622 Jul 24 '16 at 17:44
shakespeareDF is a Dataframe .. and what you need is a column.. You can select that Column by explicitly specifying column name shakespeareDF['Column_Name'] . Or using column index .. 0 in this case because that is the first(and only) column. – Tinto James Jul 25 '16 at 03:24

pyspark.sql understanding syntax and difference between functions explode and split

1 Answers1