Same Spark Dataframe created in 2 different ways gets different execution times in same query

Question

I created the same Spark Dataframe in 2 ways in order to run Spark SQL on it.

1. I read the data from a .csv file straight into a Dataframe in Spark shell using the following command:

val df=spark.read.option("header",true).csv("C:\\Users\\Tony\\Desktop\\test.csv")

2. I created a collection in MongoDB from the same .csv file and then using the Spark-MongoDB Connector, I imported it as an RDD into Spark which I then turned into a Dataframe using the following commands(in cmd/spark-shell):

spark-shell --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/myDb.myBigCollection" --packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1

import com.mongodb.spark._
val rdd = MongoSpark.load(sc)
val df = rdd.toDF()

After that I created a view of the dataframe in either case using the following command:

df.createOrReplaceTempView("sales")

Then I run the same queries on either Dataframe and the execution times were very different. In the following example, the 1st way of creating the dataframe had 4-5 times faster execution time then the 2nd one.

spark.time(spark.sql("SELECT Region, Country, `Unit Price`, `Unit Cost` FROM sales WHERE `Unit Price` > 500 AND `Unit Cost` < 510 ORDER BY Region").show())

The database has 1 million entries and has the following structure:

id: 61a6540c3838fe02b81e5338
Region: "Sub-Saharan Africa"
Country: "South Africa"
Item Type:  "Fruits"
Sales Channel: "Offline"
Order Priority: "M"
Order Date: 2012-07-26T21:00:00.000+00:00
Order ID: 443368995
Ship Date: 2012-07-27T21:00:00.000+00:00
Units Sold: 1593
Unit Price: 9.33
Unit Cost: 6.92
Total Revenue: 14862.69
Total Cost: 11023.56
Total Profit: 3839.13

The problem in my case is that I have to get the Dataframe from Mongodb using the connector but why is this happening?

score 0 · Answer 1 · edited Jan 06 '22 at 06:57

0

Spark is optimized to perform better on Dataframes. In your second approach you are first reading RDD then converting it to Dataframe which definitely has the cost.

Instead try to read data from Mongo DB directly as a dataframe. You can refer to the following syntax:

df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").option("uri", "mongodb://127.0.0.1/mydb.mycoll").load()

edited Jan 06 '22 at 06:57

Gabio

9,126
3
12
32

answered Jan 05 '22 at 07:15

Sachit Murarka

137
2
12

I just tested this but the execution times appear to be exactly the same as when first reading RDD then converting it to Dataframe. – Antonis Pervanas Jan 05 '22 at 08:59

score 0 · Accepted Answer · answered Jan 14 '22 at 18:33

0

The answer is that in the second case, the extra time is needed in order to transfer the data from mongodb to Spark before executing the query.

answered Jan 14 '22 at 18:33

Antonis Pervanas

11
4

Same Spark Dataframe created in 2 different ways gets different execution times in same query

2 Answers2