Why do I get py4j error in Pyspark when using the 'count' function

Question

I'm trying to run a simple code in pyspark but I'm getting py4j error.

from pyspark import SparkContext

logFile = "file:///home/hadoop/spark-2.1.0-bin-hadoop2.7/README.md"  
sc = SparkContext("local", "word count")
logData = sc.textFile(logFile).cache()
numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()

the error is:

An error occurred while calling o75.printStackTrace. Trace:
py4j.Py4JException: Method printStackTrace([class org.apache.spark.api.java.JavaSparkContext]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
    at py4j.Gateway.invoke(Gateway.java:274)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.base/java.lang.Thread.run(Thread.java:835)

I configured the environment variables but it still didn't work. I even tried findspark.init() but didn't work again. What am I doing wrong?

Try using `spark = SparkSession.builder.getOrCreate()` and `sc = spark.sparkContext`? — mck, Dec 19 '20 at 17:08
I did try that too but the issue is that it doesn't have the textfile method. I received that error instead of this one. — N. Rad, Dec 20 '20 at 20:18

score 0 · Accepted Answer · answered Dec 20 '20 at 09:34

0

I am sure that the environment variables are not set correctly. Could you please post all the environment variables. Mine is as below and it's working correctly

Check SCALA_HOME and SPARK_HOME especially. There should not be "bin" after the end.

My windows environments:

HADOOP_HOME = C:\spark\hadoop
JAVA_HOME = C:\Program Files\Java\jdk1.8.0_151
SCALA_HOME = *C:\spark\scala*
SPARK_HOME = *C:\spark\spark* PYSPARK_PYTHON = C:\Users\user\Anaconda3\envs\python.exe
PYSPARK_DRIVER_PYTHON = C:\Users\user\Anaconda3\envs\Scripts\jupyter.exe
PYSPARK_DRIVER_PYTHON_OPTS = notebook

answered Dec 20 '20 at 09:34

BigData-Guru

1,161
1
15
20

I'm also using windows. My environment variables are as follows: HADOOP_HOME = C:\Hadoop\hadoop-2.8.0\bin JAVA_HOME = C:\Java SPARK_HOME = C:\spark I didn't specify the rest. Do I need to? – N. Rad Dec 20 '20 at 20:16
yes. But change your path as per your installation path. – BigData-Guru Dec 21 '20 at 03:50
For some reason I don't have the spark/scala folder in my spark installation. I installed scala sbt. should I use that path instead of this? – N. Rad Dec 24 '20 at 19:43
Pls follow below link https://www.youtube.com/watch?v=WQErwxRTiW0 – BigData-Guru Dec 24 '20 at 20:00
I made some changes and now it is giving me a different error. `An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/hadoop/spark-2.1.0-bin-hadoop2.7/README.md` I don't know why it is looking in that directory. I don't have that in my paths! – N. Rad Dec 24 '20 at 20:43
It's there in that video.. Please follow the same. That worked for me. Just keep in mind SCALA_HOME and SPARK_HOME especially. There should not be "bin" after the end. – BigData-Guru Dec 25 '20 at 07:16
Yes, I noticed the issue. On the first line of the code I was referring to a location that didn't exist on my computer. Thanks – N. Rad Dec 25 '20 at 17:49

Why do I get py4j error in Pyspark when using the 'count' function

1 Answers1