0

I am taking a mooc.

It has a shakespeareDF dataframe that has below text

word                                             |
+-------------------------------------------------+
|1609                                             |
|                                                 |
|the sonnets                                      |
|                                                 |
|by william shakespeare                           |
|                                                 |
|                                                 |
|                                                 |
|1                                                |
|from fairest creatures we desire increase        |
|that thereby beautys rose might never die        |
|but as the riper should by time decease          |
|his tender heir might bear his memory            |
|but thou contracted to thine own bright eyes     |
|feedst thy lights flame with selfsubstantial fuel|
+-------------------------------------------------+

On it, they run below code

from pyspark.sql.functions import split, explode
shakeWordsDF = (shakespeareDF.select(explode(split(shakespeareDF[0],"\s+"))

I would like to understand:

  1. what is difference between explode and split and why do we have to use both? I tried to look into the online documentation and couldnt understand
  2. why do we have to use shakespeareDF[0] and not just shakespeareDF
user2543622
  • 5,760
  • 25
  • 91
  • 159

1 Answers1

0

Q.1 look here

Q.2 shakespeareDF[0] -- selecting the first column

Community
  • 1
  • 1
Tinto James
  • 121
  • 1
  • 8
  • if shakespeareDF had only 1 column then do we need to use shakespeareDF[0]? – user2543622 Jul 24 '16 at 17:44
  • shakespeareDF is a Dataframe .. and what you need is a column.. You can select that Column by explicitly specifying column name shakespeareDF['Column_Name'] . Or using column index .. 0 in this case because that is the first(and only) column. – Tinto James Jul 25 '16 at 03:24