1

For some reason the following function in my pipeline is causing an error when I run a job on EMR (using emr-5.0.0 and Spark 2.0.0):

def autv(self, f=atf):
    """

    Args:
        f:

    Returns:

    """
    if not self._utv:
        raise FileNotFoundError("Data not loaded.")
    ut = self._utv
    try:
        self._utv = (ut
                                    .rdd
                                    .map(lambda x: (x.id, (x.t, x.w)))
                                    .groupByKey()
                                    .map(lambda x: Row(id=x[0],
                                                       w=len(x[1]),
                                                       t=DenseVector(f(x[1]))))
                                    .toDF())
        return self
    except AttributeError as e:
        logging.error(e)
    return None

atf is a very simple function:

def atf(iterable):
    """

    Args:
        iterable:

    Returns:

    """
    return [stats.mean(t) for t in zip(*list(zip(*iterable))[0])]

I get a huge string of errors but here is the last part:

        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:280)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:211)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/mnt/yarn/usercache/hadoop/appcache/application_1472313936084_0003/container_1472313936084_0003_01_000002/pyspark.zip/pyspark/worker.py", line 161, in main
    func, profiler, deserializer, serializer = read_command(pickleSer, infile)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1472313936084_0003/container_1472313936084_0003_01_000002/pyspark.zip/pyspark/worker.py", line 54, in read_command   
    command = serializer._read_with_length(file)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1472313936084_0003/container_1472313936084_0003_01_000002/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
    return self.loads(obj)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1472313936084_0003/container_1472313936084_0003_01_000002/pyspark.zip/pyspark/serializers.py", line 419, in loads
    return pickle.loads(obj, encoding=encoding)
ImportError: No module named 'regression'

        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
        at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
        at org.apache.spark.scheduler.Task.run(Task.scala:85)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        ... 1 more

16/08/27 16:28:43 INFO ShutdownHookManager: Shutdown hook called
16/08/27 16:28:43 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-429a8665-405e-4a8a-9a0c-7f939020a644
16/08/27 16:28:43 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-429a8665-405e-4a8a-9a0c-7f939020a644/pyspark-41867521-9dfd-4d8f-8b13-33272063e0c3

There's an ImportError: No module named 'regression' message which doesn't make sense to me, because the rest of my script is running functions from this module and when I remove the aggregate_user_topic_vectors function, the script runs without error. Also, as I said earlier the script runs without error on my local machine even with aggregate_user_topic_vectors. I've set PYTHONPATH to see my project just to be sure. Really don't know where to go from here. Any comments would be appreciated.

Evan Zamir
  • 8,059
  • 14
  • 56
  • 83
  • you will need to supply python dependencies through spark submit.. please see http://stackoverflow.com/questions/35214231/importerror-no-module-named-numpy-on-spark-workers – Ram Ghadiyaram Aug 27 '16 at 19:05
  • It's not an external module. It's a module I wrote called "regression". I'm only using Python standard library. – Evan Zamir Aug 27 '16 at 19:27
  • okay I see. Has it been available to worker nodes of Cluster? I understand that in local mode it works and cluster mode its not resolving it. that means your module is not available to workers. Isnt it ? – Ram Ghadiyaram Aug 27 '16 at 19:44
  • @RamPrasadG I've been running everything else in this module with no problems. It's something about this specific function. I'm thinking of trying to make `groupByKey` into `reduceByKey`. I've read that these errors can be very misleading and I might just have a scaling problem. – Evan Zamir Aug 27 '16 at 20:11
  • 1
    Since it is your own module, almost definite a path / dependency issue as suggested by RamPrasad G. Maybe paste how you organise files, import, and submit to spark? – shuaiyuancn Aug 27 '16 at 20:31
  • As I already said a couple times the script runs fine on other functions in the same module. – Evan Zamir Aug 27 '16 at 20:35
  • To run scripts we clone the repo onto the driver instance and simply invoke spark-submit --master yarn scripts/script.py. This has always worked for us. – Evan Zamir Aug 27 '16 at 20:53

1 Answers1

0

Well, as I suspected, my problem was solved by moving from groupByKey (which is apparently evil) to reduceByKey (and hence, had nothing to do with the way I was importing modules). Here is the revised code. Hope this helps someone!

def autv(self):
    if not self._utv:
        raise FileNotFoundError("No data loaded.")
    ut = self._utv
    try:
        self._utv = (ut
                                    .rdd
                                    .map(lambda x: (x.id, (x.t, x.w)))
                                    .reduceByKey(lambda accum, x: (accum[0] + x[0], accum[1] + x[1]))
                                    .map(lambda row: Row(user_id=row[0],
                                                         weight=row[1][1],
                                                         topics=row[1][0]))
                                    .toDF()).cache()
        return self
    except AttributeError as e:
        logging.error(e)
    return None
Evan Zamir
  • 8,059
  • 14
  • 56
  • 83