SparkContext can only be used on the driver, not in code that it run on workers

Question

I am receiving and error when reading a file with multiple delimited file , applying a map function on the spark context text file. below is the code which is throwing the error

import json
import base64
import re
import subprocess
from subprocess import PIPE
from sys import argv
from pyspark.sql import SparkSession
from py4j.protocol import Py4JJavaError
from pyspark.sql.functions import *
from pyspark.sql import functions as F
from pyspark import SparkConf, SparkContext
import pprint
import sys
import os
import datetime
from pyspark import StorageLevel
from subprocess import Popen, PIPE
from pyspark.sql import Row
from pyspark.sql import SparkSession
from pyspark.sql import Row


class sparkimportOperations(object):
    def __init__(self, starttime, source_filename, source_delimiter):
        self.spark = ''
        self.sc = ''
        self.headcal = ''
        self.source_delimiter = source_delimiter
        self.source_filename = source_filename
        self.starttime = starttime

    def initializeSparkSession(self):
        try:
            print('HIFI_SPARK_INFO - initializeSparkSession() - Initializing Spark Session')
            self.spark = SparkSession.builder.appName("ppname").getOrCreate()
            self.sc = self.spark.sparkContext
            print('HIFI_SPARK_DEBUG - Initialized Spark Session')
            self.spark.sparkContext.setLogLevel("WARN")
            return True
        except Py4JJavaError as e:
            print('HIFI_SPARK_ERROR - initializeSparkSession() - Failed to Initialize Spark Session')
            self.status = "FAILED"
            self.error = str(e)
            self.error = self.error + 'HIFI_SPARK_ERROR - initializeSparkSession() - Failed to Initialize Spark Session'
            return False

    def importsrctemptable(self):
        self.headcal = self.spark.read.text(self.source_filename)
        df = self.spark.sparkContext.parallelize(self.headcal.take(1)).map(lambda x: Row(x)).toDF()
        df.write.json("/hdfsData/bdipoc/poc/Inbound/hifitmp/.HIFI/header_datfile" + self.starttime + ".json")
        self.headcal = self.spark.read.json(
            "/hdfsData/bdipoc/poc/Inbound/hifitmp/.HIFI/header_datfile" + self.starttime + ".json").collect()
        self.headers = self.headcal[0][0]['value']
        self.header_column = self.headers.split(self.source_delimiter)
        self.inmemdf = self.spark.sparkContext.textFile(self.source_filename).map(
            lambda x: x.split(self.source_delimiter)).toDF(self.header_column)
        self.inmemdf.show(100, False)

    def sparkImportMain(self):
        if self.initializeSparkSession():
            if self.importsrctemptable():
                return True


source_filename = '/hdfsData/bdipoc/poc/Inbound/hifi_unit_test/db2/T5706_CET_ITM_INV_MSV/'
starttime = '10121'
source_delimiter = "|,"
executeimport = sparkimportOperations(starttime, source_filename, source_delimiter)
out = executeimport.sparkImportMain()

ERROR for the above program

    rv = reduce(self.proto)
  File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/context.py", line 330, in __getnewargs__
    "It appears that you are attempting to reference SparkContext from a broadcast "
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

I have another raw program doing the same logic without class and def function which works without any issue.

from pyspark.sql import SparkSession
from pyspark.sql import Row
spark=SparkSession.builder.appName("spark engine").getOrCreate()
source_filename='/hdfsData/bdipoc/poc/Inbound/hifi_unit_test/db2/T5706_CET_ITM_INV_MSV/'
starttime = '1012'
source_delimiter = "|,"
headcal = spark.read.text(source_filename)
df = spark.sparkContext.parallelize(headcal.take(1)).map(lambda x: Row(x)).toDF()
df.write.json("/hdfsData/bdipoc/poc/Inbound/hifitmp/.HIFI/header_datfile" + starttime + ".json")
headcal=spark.read.json("/hdfsData/bdipoc/poc/Inbound/hifitmp/.HIFI/header_datfile" + starttime + ".json").collect()
headers = headcal[0][0]['value']
header_column = headers.split(source_delimiter)
inmemdf = spark.sparkContext.textFile(source_filename).map(lambda x: x.split(source_delimiter)).toDF(header_column)
inmemdf.show(10,False)

Both the program does the same thing, It reads a "|," delimited file, However the code is called in call to export in one and other one is raw python code.

It appears that you are attempting to reference SparkContext from a broadcast

Please help

Very simple, follow how the tool is supposed to work. A common mistake when 1st starting out, but entirely logical. Way too complicated, have a look at the guides. — thebluephantom, May 01 '21 at 17:42
I need to read file which is having a more than a single character delimiter with headers . Don't want to hardcode headers or delimiters. It works when we do a spark submit directly. But calling from a python script with class is having issues. Unable to crack it yet. Trying — Rafa, May 02 '21 at 18:12
Not sure I really follow. Show some visuals, it helps others. — thebluephantom, May 02 '21 at 18:14
I have edited the working code and not working code. Both logic is same, just written and executed in different way — Rafa, May 04 '21 at 03:57
I undestand, Looking ways to rewrite this statemment with regex so that map will not be called self.inmemdf = self.spark.sparkContext.textFile(self.source_filename).map( lambda x: x.split(self.source_delimiter)).toDF(self.header_column) — Rafa, May 04 '21 at 21:53

SparkContext can only be used on the driver, not in code that it run on workers

0 Answers0