-1

I have a script wordcount.py
I used setuptools to create an entry point, named wordcount, so now I can call the command from anywhere in the system.
I am trying to execute it via spark-submit (command: spark-submit wordcount) but it is failing with the following error:

Error: Cannot load main class from JAR file:/usr/local/bin/wordcount Run with --help for usage help or --verbose for debug output

However the exact same command works fine when I provide the path to the python script (command: spark-submit /home/ubuntu/wordcount.py)

Content of wordcount.py

import sys
from operator import add

from pyspark.sql import SparkSession

def main(args=None):
    if len(sys.argv) != 2:
        print("Usage: wordcount <file>", file=sys.stderr)
        exit(-1)

    spark = SparkSession\
        .builder\
        .appName("PythonWordCount")\
        .getOrCreate()

    lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
    counts = lines.flatMap(lambda x: x.split(' ')) \
                  .map(lambda x: (x, 1)) \
                  .reduceByKey(add)
    output = counts.collect()
    for (word, count) in output:
        print("%s: %i" % (word, count))

    spark.stop()

if __name__ == "__main__":
    main()

Do you know if there is a way to bypass this?
Thanks a lot in advance.

bill
  • 293
  • 2
  • 6
  • 17

2 Answers2

0

When you run spark-submit wordcount, it treats wordcount as the jar file which will have the class to be executed.
Also, it tries to find the jar in the path /usr/local/bin as you have not specified the classpath.
Please provide the contents of the wordcount file. If possible try to give the path to wordcount while executing with spark-submit.

Check this link for more Info. on the spark-submit command: https://spark.apache.org/docs/latest/submitting-applications.html

vijayinani
  • 2,548
  • 2
  • 26
  • 48
0

I found that if you rename your entry point to have a .py suffix, spark-submit will accept it as a python application:

entry_points={
    'console_scripts': [
        'wordcount.py = mymodule.wordcount:main',
    ],
}

Then the submission is accepted as expected:

spark-submit ./bin/wordcount.py
Carlos
  • 1