I don't know much spark. On the top of the code I have
from pysaprk.sql import SparkSession
import pyspark.sql.function as f
spark = SparkSession.bulder.appName(‘abc’).getOrCreate()
H = sqlContext.read.parquet(‘path to hdfs file’)
H has about 30 million records and will be used in a loop. So I wrote
H.persist().count()
I have a list of 50 strings L = [s1,s2,…,s50]
, each of which is used to build a small data frame out of H, which are supposed to be stacked on top each other. I created an empty dataframe Z
schema = StructType([define the schema here])
Z = spark.createDataFrame([],schema)
Then comes the loop
for st in L:
K = process H using st
Z = Z.union(H)
where K has at most 20 rows. When L has only 2 or 3 elements this code works. But for length of L = 50, it never ends. I learned today that I can use checkpoints. So I created a hadoop path and right above where the loop starts I wrote:
SparkContext.setCheckpointDir(dirName=‘path/to/checkpoint/dir’)
But I get the following error: missing 1 required positional argument: ‘self’
. I need to know how to fix the error and how to modify the loop to incorporate the checkpoint.