I have a script wordcount.py
I used setuptools to create an entry point, named wordcount
, so now I can call the command from anywhere in the system.
I am trying to execute it via spark-submit (command: spark-submit wordcount
) but it is failing with the following error:
Error: Cannot load main class from JAR file:/usr/local/bin/wordcount
Run with --help for usage help or --verbose for debug output
However the exact same command works fine when I provide the path to the python script (command: spark-submit /home/ubuntu/wordcount.py
)
Content of wordcount.py
import sys
from operator import add
from pyspark.sql import SparkSession
def main(args=None):
if len(sys.argv) != 2:
print("Usage: wordcount <file>", file=sys.stderr)
exit(-1)
spark = SparkSession\
.builder\
.appName("PythonWordCount")\
.getOrCreate()
lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)
output = counts.collect()
for (word, count) in output:
print("%s: %i" % (word, count))
spark.stop()
if __name__ == "__main__":
main()
Do you know if there is a way to bypass this?
Thanks a lot in advance.