I am working on proof of concept to optimize the performance of the join queries executed through drill. The underlying storage is a NO-SQL based database - Mongo DB. The time it takes to return the result of the join query is 46 seconds. Upon further analysis, as per the physical plan of the query it is observed that both the left side (1.5 million records) and right side table (1.3 million) are fully scanned and it takes 24 seconds and 20 seconds respectively.
Here is the query:
select ta.[SOME_COLUMN]
from mongo.Test.TABLEA ta
INNER JOIN mongo.Test.TABLEB ta ON ta.Id = tb.Id and ta.Id ='123'
Records in table A: 1.5 million
Records in table B: 1.3 million
Filter condition: Id is indexed field in both the tables (ascending)
Drill Plan shows Hash-join being performed:
- Why does drill fetch all the records into the memory even though filter condition on the indexed column was provided for one of the tables? At the mongo level I observed that collection scan was performed instead of index scan, what could be the reason behind this ? (given my join & filter condition are applied on indexed column)
- If the Drill planner/optimizer was intelligent enough, then records could have been filtered on the second table as well based on the join condition (to reduce the dataset and hence result in faster execution time).Is the MongoDB storage plugin not fully optimized which is causing this ?