My folder structure is currently this
|- logger
|--- __init__.py
|--- logger.py
|- another_package
|--- __init__.py
|--- module1.py
|- models
|--- model1
|------ main.py
|------ model1_utilities.py
The spark context and session are started in main.py. main.py calls model1_utilities.py like this:
results = function_a(params)
logger.info('Calculation completed')
num_rows = results.shape[0]
I have a log statement within the model1_utilities.py as well which says
logger.info("Completed function call")
just before the return statement
Now this is my error:
INFO:main:Completed function call
INFO:main:Calculation completed
22/06/16 15:04:36 ERROR Executor: Exception in task 0.0 in stage 451.0 (TID 908)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/Users/user1/Desktop/code/models/model1/model1_utilities.py", line 3, in <module>
import logger
ModuleNotFoundError: No module named 'logger'
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:555)
at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:101)
at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:50)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:508)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage87.agg_doAggregateWithoutKey_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage87.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
22/06/16 15:04:36 WARN TaskSetManager: Lost task 0.0 in stage 451.0 (TID 908) (10.0.0.162 executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/Users/user1/Desktop/code/models/model1/model1_utilities.py", line 3, in <module>
import logger
ModuleNotFoundError: No module named 'logger'
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:555)
at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:101)
at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:50)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:508)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage87.agg_doAggregateWithoutKey_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage87.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
22/06/16 15:04:36 ERROR TaskSetManager: Task 0 in stage 451.0 failed 1 times; aborting job
ERROR:main:Program did not finish successfully
Traceback (most recent call last):
File "/Users/user1/Desktop/code/models/model1/main.py, line 174, in <module>
num_alerts = results.shape[0]
File "/Users/user1/opt/anaconda3/envs/work/lib/python3.9/site-packages/pyspark/pandas/frame.py", line 7445, in shape
return len(self), len(self.columns)
File "/Users/user1/opt/anaconda3/envs/work/lib/python3.9/site-packages/pyspark/pandas/frame.py", line 11909, in __len__
return self._internal.resolved_copy.spark_frame.count()
File "/Users/user1/opt/anaconda3/envs/work/lib/python3.9/site-packages/pyspark/sql/dataframe.py", line 680, in count
return int(self._jdf.count())
File "/Users/user1/Desktop/work/tools/spark-3.2.1-bin-hadoop3.2/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py", line 1321, in __call__
return_value = get_return_value(
File "/Users/user1/opt/anaconda3/envs/work/lib/python3.9/site-packages/pyspark/sql/utils.py", line 117, in deco
raise converted from None
pyspark.sql.utils.PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
File "/Users/user1/Desktop/code/models/model1/model1_utilities.py", line 3, in <module>
import logger
ModuleNotFoundError: No module named 'logger'
Logger is imported in the model1_utilities.py file. In order to access the packages I added the path to the packages in the main.py. I haven't done it in the model1_utilities.py (not sure if that's necessary). But I don't understand why this is the issue. I searched to see if there are problems with using modules and packages in pyspark jobs but couldn't find anything. Any leads will help! Thanks!