2

I don't know much spark. On the top of the code I have

from  pysaprk.sql import SparkSession
import pyspark.sql.function as f
spark = SparkSession.bulder.appName(‘abc’).getOrCreate()
H = sqlContext.read.parquet(‘path to hdfs file’)

H has about 30 million records and will be used in a loop. So I wrote

H.persist().count()

I have a list of 50 strings L = [s1,s2,…,s50], each of which is used to build a small data frame out of H, which are supposed to be stacked on top each other. I created an empty dataframe Z

schema = StructType([define the schema here])
Z = spark.createDataFrame([],schema)

Then comes the loop

for st in L:
    K = process H using st
    Z = Z.union(H)

where K has at most 20 rows. When L has only 2 or 3 elements this code works. But for length of L = 50, it never ends. I learned today that I can use checkpoints. So I created a hadoop path and right above where the loop starts I wrote:

SparkContext.setCheckpointDir(dirName=‘path/to/checkpoint/dir’)

But I get the following error: missing 1 required positional argument: ‘self’. I need to know how to fix the error and how to modify the loop to incorporate the checkpoint.

pmjn6
  • 307
  • 1
  • 4
  • 14

1 Answers1

2

Create an object for the SparkContext and then,you need not specify the self parameter. Also, remove the name of the parameter which is not needed.

A code like below works:

from pyspark import SparkConf
from pyspark.context import SparkContext
sc = SparkContext.getOrCreate(SparkConf())
sc.setCheckpointDir(‘path/to/checkpoint/dir’)
Jim Todd
  • 1,488
  • 1
  • 11
  • 15
  • 1
    Can you tell me how I should change the loop? I want to do the union outside the loop, after all the data sets are checkpointed. If I add `H.checkpoint(eager=True)`, how should I retrieve them? Or maybe I should checkpoint Z inside the loop, as `Z.checkpoint(eager=True)`? – pmjn6 Feb 17 '19 at 09:47