1

I am writing a spark app for finding top n accessed URLs within a time frame. But This job keeps running and takes hours for 389451 records in ES for an instance. I want to reduce this time.

I am reading from Elastic search in spark as bellow

 val df = sparkSession.read
    .format("org.elasticsearch.spark.sql")
    .load(date + "/" + business)
    .withColumn("ts_str", date_format($"ts", "yyyy-MM-dd HH:mm:ss")).drop("ts").withColumnRenamed("ts_str", "ts")
    .select(selects.head, selects.tail:_*)
    .filter($"ts" === ts)
    .withColumn("url", split($"uri", "\\?")(0)).drop("uri").withColumnRenamed("url", "uri").cache()

In above DF I am reading and filtering from ElasticSearch. Also I am removing query params from URI.

Then I am doing group by

var finalDF = df.groupBy("col1","col2","col3","col4","col5","uri").agg(sum("total_bytes").alias("total_bytes"), sum("total_req").alias("total_req"))

Then I am running a window function

val partitionBy = Seq("col1","col2","col3","col4","col5")

val window = Window.partitionBy(partitionBy.head, partitionBy.tail:_*).orderBy(desc("total_req"))


finalDF = finalDF.withColumn("rank", rank.over(window)).where($"rank" <= 5).drop("rank")

Then I am writing finalDF to cassandra

finalDF.write.format("org.apache.spark.sql.cassandra").options(Map("table" -> "table", "keyspace" -> "keyspace")).mode(SaveMode.Append).save()

I have 4 data nodes in ES cluster and My Spark machine is 16 cores 64GB Ram VM. Please help me finding where the problem is.

hard coder
  • 5,449
  • 6
  • 36
  • 61
  • Could you post a screenshot of the SQL query plan? Also, are you saying you have a single VM running your Spark app? How are you starting this SparkContext (e.g using local mode, standalone cluster, how much driver memory, executor memory, cores per executor, etc.)? – Silvio Jan 05 '18 at 15:10

1 Answers1

-1

It could be a good idea that you persist your dataframe after read, because you are going to be using so many times in rank function.

eruiz
  • 9
  • 2