0

I'm trying to implement a simple Apache Spark RDD system but it seems I'm not able to access that session.

I started by doing: ./start-all.sh on /usr/local/spark/sbin

then I created a new session by doing this:

spark = (SparkSession.builder
         .appName("Oncofinder -- Preprocessing")
         .getOrCreate())

dirname = "oncofinder"
zipname = dirname + ".zip"
shutil.make_archive(dirname, 'zip', dirname + "/..", dirname)
spark.sparkContext.addPyFile(zipname)

and shipping a fresh copy of my app package to the Spark workers.

I'm using the Python library pyspark.

Then, I'm using my spark session on a function called preprocess:

train_rdd = preprocess(spark, [1, 2], tile_size=tile_size, sample_size=sample_size,
                       grayscale=grayscale, num_partitions=num_partitions, folder=folder)

and my function:

def preprocess(spark, slide_nums, folder="data", training=True, tile_size=1024, overlap=0,
               tissue_threshold=0.9, sample_size=256, grayscale=False, normalize_stains=True,
               num_partitions=20000):

    print("===PREPROCESSING===")

    slides = (spark.sparkContext
              .parallelize(slide_nums)
              .filter(lambda slide: open_slide(slide, folder, training) is not None))

and when I run this piece of code, I get:

2018-11-27 00:36:30 WARN  Utils:66 - Your hostname, luiscosta-GT62VR-6RD resolves to a loopback address: 127.0.1.1; using 192.168.1.67 instead (on interface wlp2s0)
2018-11-27 00:36:30 WARN  Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/home/luiscosta/PycharmProjects/wsi_preprocessing/oncofinder/lib/python3.6/site-packages/pyspark/jars/hadoop-auth-2.7.3.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
2018-11-27 00:36:30 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
===PREPROCESSING===

It reaches my ===PREPROCESSING=== checkpoint but it does not run my open_slide function.

I'm kind of new to Apache Spark and I apologize if this is a silly question but when I read the docs it looked really straightforward.

Kind Regardsspar

vftw
  • 1,547
  • 3
  • 22
  • 51
  • 1
    That's normal behavior. I would strongly recommend reading how Spark works, in particular about difference between [transformations](https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations) and [actions](https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions). `filter` is the former one, hence it is lazy and won't be scheduled, unless there is a subsequent action that requires its output. – zero323 Nov 27 '18 at 11:42
  • Possible duplicate of [How can I force Spark to execute code?](https://stackoverflow.com/q/31383904/6910411) – zero323 Nov 27 '18 at 11:44

0 Answers0