4

I am working on proof of concept to optimize the performance of the join queries executed through drill. The underlying storage is a NO-SQL based database - Mongo DB. The time it takes to return the result of the join query is 46 seconds. Upon further analysis, as per the physical plan of the query it is observed that both the left side (1.5 million records) and right side table (1.3 million) are fully scanned and it takes 24 seconds and 20 seconds respectively.

Here is the query:

select ta.[SOME_COLUMN]  
from mongo.Test.TABLEA ta  
INNER JOIN mongo.Test.TABLEB ta ON ta.Id = tb.Id and ta.Id ='123'
  • Records in table A: 1.5 million

  • Records in table B: 1.3 million

Filter condition: Id is indexed field in both the tables (ascending)

Drill Plan shows Hash-join being performed:

enter image description here

  1. Why does drill fetch all the records into the memory even though filter condition on the indexed column was provided for one of the tables? At the mongo level I observed that collection scan was performed instead of index scan, what could be the reason behind this ? (given my join & filter condition are applied on indexed column)
  2. If the Drill planner/optimizer was intelligent enough, then records could have been filtered on the second table as well based on the join condition (to reduce the dataset and hence result in faster execution time).Is the MongoDB storage plugin not fully optimized which is causing this ?
Mridul Chopra
  • 91
  • 1
  • 4
  • Or maybe it is because MongoDB has no joins? Even it's $lookup aggregation function still does exactly this – Sammaye Nov 04 '16 at 08:50
  • 1
    True that mongoDB does not have join. But Drill is providing that functionality by breaking the join into scan from constituent table and doing a HashJoin. But It is not doing the scan in efficient way. All filtering, projection, sorting etc should be passed down to mongoDB as it can do it more efficiently using the indexes. – Nikhilesh Gargi Nov 07 '16 at 08:01

1 Answers1

2

Drill's MongoDB storage plugin doesn't support pushing down joins. It supports pushing down filters only.

Storage plugins are the ones responsible for communicating with the source of data that Drill supports.

You'll have to provide your own optimization rules to pushdown joins if you wish. This will require Apache Calcite experience and writing a new storage plugin that contributes those rules.

I don't know if there is another way to contribute new rules except by having a custom plugin that does so. The plugin should not deal with MongoDB, it will only have to provide the necessary rules.

Muhammad Gelbana
  • 3,890
  • 3
  • 43
  • 81