Spark job runs locally but fails on EMR - can't figure out why

Question

For some reason the following function in my pipeline is causing an error when I run a job on EMR (using emr-5.0.0 and Spark 2.0.0):

def autv(self, f=atf):
    """

    Args:
        f:

    Returns:

    """
    if not self._utv:
        raise FileNotFoundError("Data not loaded.")
    ut = self._utv
    try:
        self._utv = (ut
                                    .rdd
                                    .map(lambda x: (x.id, (x.t, x.w)))
                                    .groupByKey()
                                    .map(lambda x: Row(id=x[0],
                                                       w=len(x[1]),
                                                       t=DenseVector(f(x[1]))))
                                    .toDF())
        return self
    except AttributeError as e:
        logging.error(e)
    return None

atf is a very simple function:

def atf(iterable):
    """

    Args:
        iterable:

    Returns:

    """
    return [stats.mean(t) for t in zip(*list(zip(*iterable))[0])]

I get a huge string of errors but here is the last part:

        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:280)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:211)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/mnt/yarn/usercache/hadoop/appcache/application_1472313936084_0003/container_1472313936084_0003_01_000002/pyspark.zip/pyspark/worker.py", line 161, in main
    func, profiler, deserializer, serializer = read_command(pickleSer, infile)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1472313936084_0003/container_1472313936084_0003_01_000002/pyspark.zip/pyspark/worker.py", line 54, in read_command   
    command = serializer._read_with_length(file)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1472313936084_0003/container_1472313936084_0003_01_000002/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
    return self.loads(obj)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1472313936084_0003/container_1472313936084_0003_01_000002/pyspark.zip/pyspark/serializers.py", line 419, in loads
    return pickle.loads(obj, encoding=encoding)
ImportError: No module named 'regression'

        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
        at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
        at org.apache.spark.scheduler.Task.run(Task.scala:85)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        ... 1 more

16/08/27 16:28:43 INFO ShutdownHookManager: Shutdown hook called
16/08/27 16:28:43 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-429a8665-405e-4a8a-9a0c-7f939020a644
16/08/27 16:28:43 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-429a8665-405e-4a8a-9a0c-7f939020a644/pyspark-41867521-9dfd-4d8f-8b13-33272063e0c3

There's an ImportError: No module named 'regression' message which doesn't make sense to me, because the rest of my script is running functions from this module and when I remove the aggregate_user_topic_vectors function, the script runs without error. Also, as I said earlier the script runs without error on my local machine even with aggregate_user_topic_vectors. I've set PYTHONPATH to see my project just to be sure. Really don't know where to go from here. Any comments would be appreciated.

you will need to supply python dependencies through spark submit.. please see http://stackoverflow.com/questions/35214231/importerror-no-module-named-numpy-on-spark-workers — Ram Ghadiyaram, Aug 27 '16 at 19:05
It's not an external module. It's a module I wrote called "regression". I'm only using Python standard library. — Evan Zamir, Aug 27 '16 at 19:27
okay I see. Has it been available to worker nodes of Cluster? I understand that in local mode it works and cluster mode its not resolving it. that means your module is not available to workers. Isnt it ? — Ram Ghadiyaram, Aug 27 '16 at 19:44
@RamPrasadG I've been running everything else in this module with no problems. It's something about this specific function. I'm thinking of trying to make `groupByKey` into `reduceByKey`. I've read that these errors can be very misleading and I might just have a scaling problem. — Evan Zamir, Aug 27 '16 at 20:11
Since it is your own module, almost definite a path / dependency issue as suggested by RamPrasad G. Maybe paste how you organise files, import, and submit to spark? — shuaiyuancn, Aug 27 '16 at 20:31
As I already said a couple times the script runs fine on other functions in the same module. — Evan Zamir, Aug 27 '16 at 20:35
To run scripts we clone the repo onto the driver instance and simply invoke spark-submit --master yarn scripts/script.py. This has always worked for us. — Evan Zamir, Aug 27 '16 at 20:53

score 0 · Answer 1 · answered Aug 28 '16 at 18:55

Well, as I suspected, my problem was solved by moving from groupByKey (which is apparently evil) to reduceByKey (and hence, had nothing to do with the way I was importing modules). Here is the revised code. Hope this helps someone!

def autv(self):
    if not self._utv:
        raise FileNotFoundError("No data loaded.")
    ut = self._utv
    try:
        self._utv = (ut
                                    .rdd
                                    .map(lambda x: (x.id, (x.t, x.w)))
                                    .reduceByKey(lambda accum, x: (accum[0] + x[0], accum[1] + x[1]))
                                    .map(lambda row: Row(user_id=row[0],
                                                         weight=row[1][1],
                                                         topics=row[1][0]))
                                    .toDF()).cache()
        return self
    except AttributeError as e:
        logging.error(e)
    return None

Spark job runs locally but fails on EMR - can't figure out why

1 Answers1