1

I am trying to use your project named dask-spark proposed by Matthew Rocklin.

When adding the dask-spark into my project, I have a problem: Waiting for workers as shown in the following figure.

Here, I run two worker nodes (dask) as dask-worker tcp://ubuntu8:8786 and tcp://ubuntu9:8786 and run two worker nodes (spark) over a standalone model, as worker-20180918112328-ubuntu8-45764 and worker-20180918112413-ubuntu9-41972

Waiting for workers

My python code is as:

from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.externals import joblib
from dask.distributed import Client
import distributed.joblib    
from sklearn.externals.joblib import parallel_backend
from dask_spark import spark_to_dask
from pyspark import SparkConf, SparkContext
from dask_spark import dask_to_spark

if __name__ == '__main__':

  sc = SparkContext()    
  #connect to the cluster
  client = spark_to_dask(sc) 
  digits = load_digits()
  X_train, X_test, y_train, y_test = train_test_split(
    digits.data,
    digits.target,
    train_size=0.75,
    test_size=0.25,
  )

  tpot = TPOTClassifier(
  generations=2,
  population_size=10,
  cv=2,
  n_jobs=-1,
  random_state=0,
  verbosity=0  
  )
  with joblib.parallel_backend('dask.distributed', scheduler_host=' ubuntu8:8786'):
  tpot.fit(X_train, y_train)    

  print(tpot.score(X_test, y_test))

I will highly appreciate it if you can help me to solve this question.

2 Answers2

1

I have revised the program in core.py, as:

def spark_to_dask(sc, loop=None):
    """ Launch a Dask cluster from a Spark Context
    """
    cluster = LocalCluster(n_workers=None, loop=loop, threads_per_worker=None)
    rdd = sc.parallelize(range(1000))
    address = cluster.scheduler.address

Following which, running my test case over Spark with Standalone or Mesos was successful.

Samuel Liew
  • 76,741
  • 107
  • 159
  • 260
0

As noted in the README of the project, dask-spark is not mature. It was a weekend project and I do not recommend its use.

Instead, I recommend launching Dask directly using one of the mechanisms described here: http://dask.pydata.org/en/latest/setup.html

If you have to use Mesos then I'm not sure I'll be of much help, but there is a package daskathon that runs on top of Marathon that may interest you.

mdurant
  • 27,272
  • 5
  • 45
  • 74
MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • Here, i have another problem: The "spark_to_dask" api function is to create Dask cluster from a Spark cluster. Whether the computation of tpot algorithm on a dask cluster is performed on a spark cluster or not? For my project purpose, i would like to execute the tpot algorithm over a spark cluster. I am not sure if the "spark_to_dask" from dask-spark is correct way to achieve my goal which the computation is finished by RDD operator of Spark. Or I should use the dask_to_spark to achieve the goal? –  Sep 25 '18 at 15:51