PySpark throws TypeError on simple .map() call

Question

I'm working with some simple transformations utilizing PySpark and keep bumping into a 'bool' object is not callable error. The Spark version is 1.3.0.

I've come across this being asked about in a few other spots (i.e. here and here), but the suggestion seems to be just to verify major python versions are aligned between the driver and workers, which I have done (each are an Anaconda distribution with python version 2.7.10).

For debugging this, I've been using the iris dataset stored in HDFS:

data = sc.textFile("/path/to/iris.csv")
data.count()  # works fine, returns 150
data.map(lambda x: x[:2])  # just subsets the string, works fine
data.map(lambda x: x.split(','))  # throws error below

These (obviously) fail when calling .collect(), .take(), .count() and the map call is evaluated. So, I am basically looking for any further ideas/things to try to get things configured properly.

15/09/28 17:55:08 INFO YarnScheduler: Removed TaskSet 14.0, whose tasks have all completed, from pool: 
An error occurred while calling o135.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1     in stage 14.0 failed 4 times, most recent failure: Lost task 1.3 in stage 14.0:     org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars/spark-assembly-1.3.0-cdh5.4.5-hadoop2.6.0-cdh5.4.5.jar/pyspark/worker.py", line 101, in main
process()
File "/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/jars/spark-assembly-1.3.0-cdh5.4.5-hadoop2.6.0-cdh5.4.5.jar/pyspark/worker.py", line 96, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/lib/spark/python/pyspark/rdd.py", line 2253, in pipeline_func
return func(split, prev_func(split, iterator))
File "/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/lib/spark/python/pyspark/rdd.py", line 2253, in pipeline_func
return func(split, prev_func(split, iterator))
File "/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/lib/spark/python/pyspark/rdd.py", line 2253, in pipeline_func
return func(split, prev_func(split, iterator))
File "/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/lib/spark/python/pyspark/rdd.py", line 270, in func
return f(iterator)
File "/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/lib/spark/python/pyspark/rdd.py", line 933, in <lambda>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "/opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/lib/spark/python/pyspark/rdd.py", line 933, in <genexpr>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "<stdin>", line 1, in <lambda>
**TypeError: 'bool' object is not callable**

    at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
    at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:176)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.scheduler.Task.run(Task.scala:64)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

How did you check they python versions? Usually PySpark will use the system version of python (/usr/bin/python) unless you explicitly specify the PYSPARK_PYTHON environment variable. — santon, Sep 30 '15 at 03:06
Hmm, basically by checking the path python (`which python`). I also experimented with setting PYSPARK_PYTHON in the pyspark-env.sh script. — devmacrile, Sep 30 '15 at 20:59
But where are you executing `which python`? If you execute it in the main process (the one where you're also setting up `SparkConf` and `SparkContext`), it will return the path you want. But unless `PYSPARK_PYTHON` is set, the workers will likely use the default system python. For my own sanity, I always make sure to set `PYSPARK_PYTHON` environment variable in the same script that I configure everything else. Check out [this answer](http://stackoverflow.com/a/32240423/2708667) to see if that helps. — santon, Oct 01 '15 at 15:17

PySpark throws TypeError on simple .map() call

0 Answers0