I want to understand what is happening under the hood when I run the following script named t1.py with python3 t1.py. Specifically, I have the following questions:
- What kind of code is submitted to the spark worker node? Is it the python code or a translated equivalent Java code submitted to the spark worker node?
- Is the add operation in the reduce treated as UDF and thus run in a python subprocess on the worker node?
- If the add operation run in a python subprocess on the worker node, does the worker JVM communicates with the python subprocess for each number in a partition being added? If this is the case, it means a lot of overhead.
#!/home/python3/venv/bin/python3
#this file is named t1.py
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import DecimalType, IntegerType
import pyspark.sql.functions as F
from operator import add
import pandas as pd
from datetime import datetime
len = int(100000000/1)
print("len=", len)
spark = SparkSession.builder.appName('ai_project').getOrCreate()
start = datetime.now()
t=spark.sparkContext.parallelize(range(len))
a = t.reduce(add)
print(a)
end= datetime.now()
print("end for spark rdd sum:", end, end-start)