2

I have a spark cluster setup with 1 master node and 2 worker nodes. I am running a pyspark application in this spark standalone cluster where I have a job to write the transformed data into Mysql database.

So, I have a question here whether writing to database is done by driver or executor? Because when writing to a textfile, it's done by driver since my output file gets created in driver

Updated

Adding below the code I have used to write to a text file

from pyspark import SparkConf,SparkContext
if __name__ =="__main__":
    sc = SparkContext(master = "spark://IP:PORT",appName='word_count_application')
    words = sc.textFile("book_2.txt")
    word_count = words.flatMap(lambda a : a.split(" ")).map(lambda a : (a,1)).reduceByKey(lambda a,b : a+b)
    word_count.saveAsTextFile("book2_output.txt")
Nikunj Kakadiya
  • 2,689
  • 2
  • 20
  • 35
Saranraj K
  • 412
  • 1
  • 7
  • 19

2 Answers2

2

If the writing is done using dataset/datafame api like this:

df.write.csv("...")

Then it's done by the executors, that why in spark we have multiple files in the output because each executor will write each partition defined inside it.

The driver is used for scheduling work across the executors, and not for doing the actual work ( reading, transforming and writing) which will be done by the executors

Abdennacer Lachiheb
  • 4,388
  • 7
  • 30
  • 61
  • @abdenncar thanks you for your reply, I have tried writing output to text file, and my output present only on driver machine, in that case how executor writes output to the machine where driver is running? – Saranraj K Dec 13 '22 at 10:28
  • @SaranrajK the output can be anywhere, the node having the driver or a n external file system, or blob or datalake ..., it depends from your configuration, your question is who's doing the writing which is always the executors, nad not where the writing is done, can you share how you are soing your writing maybe we can understand why it's writing to the driver. – Abdennacer Lachiheb Dec 13 '22 at 11:39
  • @abdenncar Please find the edited query. have updated the code I used for writing – Saranraj K Dec 13 '22 at 13:18
0

saveAsTextFile() is distributed, each executor is writing files. Your driver will never write any files since, as @Abdennacer Lachiheb already mentioned, it is responsible for scheduling, the Spark UI and more.

Your path is referring to a local file system, so your files are not getting saved on your driver, but on the machine your driver runs. The path could also be an object storage like S3 or HDFS.

Robert Kossendey
  • 6,733
  • 2
  • 12
  • 42