0

I've installed Pyspark in my computer, and run it with Anaconda prompt. When I launch pyspark in the prompt I get an error when using Show() function

Here are my manipulations after launching pyspark:

  • import sys
  • sys.path.append("C:\Spark\spark-3.3.0-bin-hadoop3\jars")
  • from graphframes import GraphFrame (also tried with from graphframes import *)
  • pp=spark.createDataFrame([("a", "Alice", 34),("b", "Bob", 36),("c", "Charlie", 30)], ["id", "name", "age"])
  • e = spark.createDataFrame([("a", "b", "friend"),("b", "c", "follow"),("c", "b", "follow")], ["src", "dst", "relationship"])
  • g = GraphFrame(pp, e)
  • m = g.inDegrees
  • m.show()

Then I get this message :

"Python can't be found.Execute without arguments to proceed
[Stage 1:>                  (0 + 5) / 8][Stage 2:>                  (0 + 3) / 8]

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Spark\spark-3.3.0-bin-hadoop3\python\pyspark\sql\dataframe.py", line 606, in show
    print(self._jdf.showString(n, 20, vertical))
  File "C:\Spark\spark-3.3.0-bin-hadoop3\python\lib\py4j-0.10.9.5-src.zip\py4j\java_gateway.py", line 1321, in __call__
  File "C:\Spark\spark-3.3.0-bin-hadoop3\python\pyspark\sql\utils.py", line 190, in deco
    return f(*a, **kw)
  File "C:\Spark\spark-3.3.0-bin-hadoop3\python\lib\py4j-0.10.9.5-src.zip\py4j\protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o78.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 1.0 failed 1 times, most recent failure: Lost task 4.0 in stage 1.0 (TID 12) (LAPTOP-UUCFNR39 executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189)
        at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)
        at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
        at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:136)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1589)
Caused by: java.net.SocketTimeoutException: Accept timed out
        at java.base/sun.nio.ch.NioSocketImpl.timedAccept(NioSocketImpl.java:694)
        at java.base/sun.nio.ch.NioSocketImpl.accept(NioSocketImpl.java:738)
        at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:690)
        at java.base/java.net.ServerSocket.platformImplAccept(ServerSocket.java:655)
        at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:631)
        at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:588)
        at java.base/java.net.ServerSocket.accept(ServerSocket.java:546)
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176)
        ... 31 more

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)
        at scala.Option.foreach(Option.scala:407)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Caused by: org.apache.spark.SparkException: Python worker failed to connect back.
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189)
        at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)
        at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
        at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:136)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1589)
Caused by: java.net.SocketTimeoutException: Accept timed out
        at java.base/sun.nio.ch.NioSocketImpl.timedAccept(NioSocketImpl.java:694)
        at java.base/sun.nio.ch.NioSocketImpl.accept(NioSocketImpl.java:738)
        at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:690)
        at java.base/java.net.ServerSocket.platformImplAccept(ServerSocket.java:655)
        at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:631)
        at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:588)
        at java.base/java.net.ServerSocket.accept(ServerSocket.java:546)
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176)
        ... 31 more

I have added graphframes-0.8.2-spark3.2-s_2.12.jar to the jar folder of my pyspark home because graphframes was not recognised

EDIT : I've just tried the same manipulations using VsCode (using jupyter mode, don't know if it changes anything) and it works !

fluflu
  • 5
  • 3

0 Answers0