PySpark error when getting shape of Dataframe using Pandas on spark API

Question

The spark context and session are started in main.py. main.py calls model1_utilities.py like this:

results = function_a(params)
logger.info('Calculation completed')
num_rows = results.shape[0]

I have a log statement within the model1_utilities.py as well which says logger.info("Completed function call") just before the return statement

Now this is my error:

INFO:main:Completed function call                               
INFO:main:Calculation completed
22/06/16 15:04:36 ERROR Executor: Exception in task 0.0 in stage 451.0 (TID 908)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/Users/user1/Desktop/code/models/model1/model1_utilities.py", line 3, in <module>
    import logger
ModuleNotFoundError: No module named 'logger'

        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:555)
        at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:101)
        at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:50)
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:508)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage87.agg_doAggregateWithoutKey_0$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage87.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
        at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
22/06/16 15:04:36 WARN TaskSetManager: Lost task 0.0 in stage 451.0 (TID 908) (10.0.0.162 executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/Users/user1/Desktop/code/models/model1/model1_utilities.py", line 3, in <module>
    import logger
ModuleNotFoundError: No module named 'logger'

        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:555)
        at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:101)
        at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:50)
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:508)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage87.agg_doAggregateWithoutKey_0$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage87.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
        at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

22/06/16 15:04:36 ERROR TaskSetManager: Task 0 in stage 451.0 failed 1 times; aborting job
ERROR:main:Program did not finish successfully
Traceback (most recent call last):
  File "/Users/user1/Desktop/code/models/model1/main.py, line 174, in <module>
    num_alerts = results.shape[0]
  File "/Users/user1/opt/anaconda3/envs/work/lib/python3.9/site-packages/pyspark/pandas/frame.py", line 7445, in shape
    return len(self), len(self.columns)
  File "/Users/user1/opt/anaconda3/envs/work/lib/python3.9/site-packages/pyspark/pandas/frame.py", line 11909, in __len__
    return self._internal.resolved_copy.spark_frame.count()
  File "/Users/user1/opt/anaconda3/envs/work/lib/python3.9/site-packages/pyspark/sql/dataframe.py", line 680, in count
    return int(self._jdf.count())
  File "/Users/user1/Desktop/work/tools/spark-3.2.1-bin-hadoop3.2/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/Users/user1/opt/anaconda3/envs/work/lib/python3.9/site-packages/pyspark/sql/utils.py", line 117, in deco
    raise converted from None
pyspark.sql.utils.PythonException: 
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "/Users/user1/Desktop/code/models/model1/model1_utilities.py", line 3, in <module>
    import logger
ModuleNotFoundError: No module named 'logger'

Logger is imported in the model1_utilities.py file. In order to access the packages I added the path to the packages in the main.py. I haven't done it in the model1_utilities.py (not sure if that's necessary). But I don't understand why this is the issue. I searched to see if there are problems with using modules and packages in pyspark jobs but couldn't find anything. Any leads will help! Thanks!

score 0 · Answer 1 · answered Jun 17 '22 at 20:11

Turns out this was not a pyspark error. The error appeared because I was not adding the path to packages in model1_utilities.py. Adding it fixed the error. Wasn't aware that python required the package path to be added in every file that wishes to use the package.

PySpark error when getting shape of Dataframe using Pandas on spark API

1 Answers1