Submit Python Script into Spark Cluster

Question

Im trying to submit the following python script into Spark Cluster. I have 2 slaves running

from sklearn import grid_search, datasets
from sklearn.ensemble import RandomForestClassifier
# Use spark_sklearn’s grid search instead:
from spark_sklearn.grid_search import GridSearchCV
digits = datasets.load_digits()
X, y = digits.data, digits.target
param_grid = {"max_depth": [3, None],
              "max_features": [1, 3, 10],
              "min_samples_split": [2, 3, 10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"],
              "n_estimators": [10, 20, 40, 80]}
gs = grid_search.GridSearchCV(RandomForestClassifier(), param_grid=param_grid)
gs.fit(X, y)

I'm using following command from shell to submit the application

./bin/spark-submit --master spark://122.138.1.66:7077 '/script/trainspark.py'

However I dont see that in "Running Applications" section in the Master GUI. Am I missing anything?

score 0 · Answer 1 · edited Dec 30 '18 at 04:13

For submitting python script on spark, there are three types of cluster deployment available:

Apache Spark standalone cluster
YARN
Mesos

For standalone mode

If you use --deploy-mode cluster while spark-submit, then python script will run as expected but no UI and it will not run in cluster mode.
If you use --deploy-mode client while spark-submit, then python script will run in cluster mode and application will be displayed on UI. for this you have to set spark master url pointing to spark master url node ip as (spark://x.x.x.x:7077) and provide application name in conf which will be displayed on UI. Need to run python script on master node only and no need to copy python script on slave node.

Submit Python Script into Spark Cluster

1 Answers1