I'm having problem connecting to a spark cluster remotely from jupyter notebook. It works fine locally.
Method 1:
conf = pyspark.SparkConf().setAppName('Pi').setMaster('spark://my-cluster:7077')
sc = pyspark.SparkContext(conf=conf)
This returns successfully. when i then try to run the Pi example
partitions = 3
n = 1000 * partitions
def f(_):
x = random() * 2 - 1
y = random() * 2 - 1
return 1 if x ** 2 + y ** 2 <= 1 else 0
count = sc.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n))
it just keeps on running forever with the error in console:
warn taskschedulerimpl: initial job has not accepted any resources; check your cluster ui to ensure that workers are registered and have sufficient resources
In the cluster ui the job is set as "running" and the workers are alive however it seems no work is actually being done.
Worker Id Address State Cores Memory
worker-20180417174137-192.168.1.13-43697 192.168.1.13:43697 ALIVE 4 (4 Used) 6.8 GB (1024.0 MB Used)
worker-20180417174137-192.168.1.14-38778 192.168.1.14:38778 ALIVE 4 (4 Used) 6.8 GB (1024.0 MB Used)
worker-20180417174137-192.168.1.15-35776 192.168.1.15:35776 ALIVE 4 (4 Used) 6.8 GB (1024.0 MB Used)
So resources should exist more than enough for this simple job anyway. So what could be causing the issue here?
Due to remarks in comments to this question about the exact same issue I tried to connect to the cluster with yarn as master.
conf.set('spark.hadoop.yarn.resourcemanager.address', 'my-cluster:8032')
conf.set('spark.hadoop.fs.default.name', 'hdfs://my-cluster:9000')
conf.set('spark.submit.deployMode', 'client')
This resulted in this issue and I applied the answer accordingly. initiating the context then with
sc = pyspark.SparkContext(conf=conf)
Simply does not return, blocks endlessly until manually cancelled. I tried this in my Ubuntu VM and then thought maybe it's a network issue to the VM and installed spark in the host Windows 7 system too with the exact same result. So how to I successfully connect to the cluster and launch applications?