0

While develloping a SparkStreaming application (python), I'm not completely sure if I understand well how it works. I just have to read a json file stream (poping in a directory) and perform a join operation on each json object and a reference, and then, write it back to text files. here is my code:

config = configparser.ConfigParser()
config.read("config.conf")

def getSparkSessionInstance(sparkConf):
if ("sparkSessionSingletonInstance" not in globals()):
    globals()["sparkSessionSingletonInstance"] = SparkSession \
        .builder \
        .config(conf=sparkConf) \
        .getOrCreate()
return globals()["sparkSessionSingletonInstance"]

# Création du contexte
sc = SparkContext()
ssc = StreamingContext(sc, int(config["Variables"]["batch_period_spark"]))
sqlCtxt = getSparkSessionInstance(sc.getConf())
df_ref = sqlCtxt.read.json("file://" + config["Paths"]["path_ref"])
df_ref.createOrReplaceTempView("REF")
df_ref.cache()
output = config["Paths"]["path_DATAs_enri"]


# Fonction de traitement des DATAs
def process(rdd):
        if rdd.count() > 0:
                #print(rdd.toDebugString)
                df_DATAs = sqlCtxt.read.json(rdd)
                df_DATAs.createOrReplaceTempView("DATAs")
                df_enri=sqlCtxt.sql("SELECT DATAs.*, REF.Name, REF.Mail FROM DATAs, REF WHERE DATAs.ID = REF.ID")
                df_enri.createOrReplaceTempView("DATAs_enri")
                df_enri.write.mode('append').json("file://" + output)
                if(df_enri.count() < df_DATAs.count()):
                        df_fail = sqlCtxt.sql("SELECT * FROM DATAs WHERE DATAs.ID NOT IN (SELECT ID FROM DATAs_enri)")
                        df_fail.show()


# Configuration du stream et lancement
files = ssc.textFileStream("file://" + config["Paths"]["path_stream_DATAs"])
files.foreachRDD(process)
print("[GO]")
ssc.start()
ssc.awaitTermination()

Here is my spark config:

spark.master                    local[*]
spark.executor.memory           3g
spark.driver.memory             3g
spark.python.worker.memory      3g
spark.memory.fraction           0.9
spark.driver.maxResultSize      3g
spark.memory.storageFraction    0.9
spark.eventLog.enabled          true

Well, it is working, but I have a question: The process is slow and the process delay is increasing. I am working in local[*], and I am afraid that there is no parrallelism... In the monitoring UI, I only see one executor and one job at a time. Is there any simpler way to do it? Like with the transform function on DStream? Is there a configuration variable I am missing?

Flibidi
  • 153
  • 2
  • 12

1 Answers1

0

Well there are a few reasons for your code is Slow.

About the workers, as I saw I didn't see any place that you set the number of workers. So it will start with the default number of workers that means maybe 1. In other side, you are reading from one file that could be not that big and spark is not doing the parallelism.

In other hand you need to undesrtand few steps of your code:

  1. You have a lot of counts:if rdd.count() > 0:; if(df_enri.count() < df_DATAs.count()):, counts are expensive, the are a reduce phase in your streaming data, and you are doing 3 times the count.
  2. Joins are expensive too, doing a join in a streaming process is not that good,you did right doing the df_ref.cache() but, join does shuffle and it is expensive.

What I suggest to you, don't do that fail step, remove that from your code. It it didn't work, just don't save the data. Other thing, set more workers or more cores for execution with: spark.executor.cores=2 as you can see here.

Thiago Baldim
  • 7,362
  • 3
  • 29
  • 51
  • Well, thanks a lot for these advices! I have another question, the first _count_ in my _if_ is here to prevent Spark from processing rdd's before they come. Is it normal that SparkStreaming starts the oprations on the stream even if there's nothing to process yet? Because if I don't do it, it tells me that i am processing empty RDD... – Flibidi May 02 '17 at 14:53
  • The first `count()` I suggest you to use the function `isEmpty()` https://spark.apache.org/docs/2.1.0/api/python/pyspark.html#pyspark.RDD.isEmpty this is faster to check if your is empty or not. This will not generate shuffles. – Thiago Baldim May 02 '17 at 14:58