Can anyone please help how to
- submit a pyspark job in google cloud shell
- pass files and arguments in pyspark submit
- read those files and arguments in pyspark code
Can anyone please help how to
To submit a job to a Dataproc cluster, run the gcloud CLI gcloud dataproc jobs submit command locally in a terminal window or in Cloud Shell. Here is the detailed official documentation.
With the help of spark-submit you can pass program arguments you'll see that spark-submit has following syntax:
./bin/spark-submit \ --class <main-class> \ --master <master-url> \ --deploy-mode <deploy-mode> \ --conf <key>=<value> \ ... # other options <application-jar> \ [application-arguments]
You can use either application-arguments and conf to pass required configuration to the main method and SparkConf respectively.Here is the documentation.
3.Some information on reading files with Spark:
If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/.txt"), and textFile("/my/directory/.gz"). T
he textFile method also takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 64MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.Here is blog explaining it.
Here is the guide for all processes.