I created the same Spark Dataframe in 2 ways in order to run Spark SQL on it.
1. I read the data from a .csv file straight into a Dataframe in Spark shell using the following command:
val df=spark.read.option("header",true).csv("C:\\Users\\Tony\\Desktop\\test.csv")
2. I created a collection in MongoDB from the same .csv file and then using the Spark-MongoDB Connector, I imported it as an RDD into Spark which I then turned into a Dataframe using the following commands(in cmd/spark-shell):
spark-shell --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/myDb.myBigCollection" --packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1
import com.mongodb.spark._
val rdd = MongoSpark.load(sc)
val df = rdd.toDF()
After that I created a view of the dataframe in either case using the following command:
df.createOrReplaceTempView("sales")
Then I run the same queries on either Dataframe and the execution times were very different. In the following example, the 1st way of creating the dataframe had 4-5 times faster execution time then the 2nd one.
spark.time(spark.sql("SELECT Region, Country, `Unit Price`, `Unit Cost` FROM sales WHERE `Unit Price` > 500 AND `Unit Cost` < 510 ORDER BY Region").show())
The database has 1 million entries and has the following structure:
id: 61a6540c3838fe02b81e5338
Region: "Sub-Saharan Africa"
Country: "South Africa"
Item Type: "Fruits"
Sales Channel: "Offline"
Order Priority: "M"
Order Date: 2012-07-26T21:00:00.000+00:00
Order ID: 443368995
Ship Date: 2012-07-27T21:00:00.000+00:00
Units Sold: 1593
Unit Price: 9.33
Unit Cost: 6.92
Total Revenue: 14862.69
Total Cost: 11023.56
Total Profit: 3839.13
The problem in my case is that I have to get the Dataframe from Mongodb using the connector but why is this happening?