0

The query plan was showing a collection scan which is going over all the rows in the mongo collection. Hence I created an index on the where clause columns, expecting Drill to choose the index based access plan. But drill continues to use the full table scan. Is there any thing else to be done to make drill use the index ?

The actual query, generated query plan and the mongo index are given below.

SQL:

    Select j.user as User, TO_DATE(j.created_at) as submitted_on
    from mongo.example.jobs j
    where j.user = 'john@example.ai' and j.created_at BETWEEN timestamp '2020-03-25 13:12:55' AND timestamp '2020-04-24 13:12:55'

Physical Plan (from drill UI)

    00-00 Screen : rowType = RecordType(ANY User, ANY submitted_on): rowcount = 121.2375, cumulative cost = {6720.59875 rows, 23532.19875 cpu, 895541.0 io, 0.0 network, 0.0 memory}, id = 10468
    00-01 Project(User=[$0], submitted_on=[TO_DATE($1)]) : rowType = RecordType(ANY User, ANY submitted_on): rowcount = 121.2375, cumulative cost = {6708.475 rows, 23520.075 cpu, 895541.0 io, 0.0 network, 0.0 memory}, id = 10467
    00-02 SelectionVectorRemover : rowType = RecordType(ANY user, ANY created_at): rowcount = 121.2375, cumulative cost = {6587.2375 rows, 22913.8875 cpu, 895541.0 io, 0.0 network, 0.0 memory}, id = 10466
    00-03 Filter(condition=[AND(=($0, 'john@example.ai'), >=($1, 2020-03-25 13:12:55), <=($1, 2020-04-24 13:12:55))]) : rowType = RecordType(ANY user, ANY created_at): rowcount = 121.2375, cumulative cost = {6466.0 rows, 22792.65 cpu, 895541.0 io, 0.0 network, 0.0 memory}, id = 10465
    00-04 Scan(table=[[mongo, example, jobs]], groupscan=[MongoGroupScan [MongoScanSpec=MongoScanSpec [dbName=example, collectionName=jobs, filters=null], columns=[`user`, `created_at`]]]) : rowType = RecordType(ANY user, ANY created_at): rowcount = 3233.0, cumulative cost = {3233.0 rows, 6466.0 cpu, 895541.0 io, 0.0 network, 0.0 memory}, id = 10464

Index created in MongoDB

{
"v" : 2,
"key" : { "user" : 1, "created_at" : 1, "method_map_id" : 1 },
"name" : "user_1_created_at_1_method_map_id_1",
"ns" : "example.jobs"
}

Moreover, in a drill documentation, I see that drill supports indexes only for MapR DB. Does that mean Indexes of other data sources like mongo won't be used ?

https://drill.apache.org/docs/querying-indexes-introduction/

bukli
  • 172
  • 2
  • 9
  • With only 121 rows, it might actually be faster to read them all sequentially than scan an index. – Joe Apr 28 '20 at 08:58
  • @Joe It is actually 121 rows from a table of 3233 rows. But, is there a setting that determines full table scan? – bukli Apr 30 '20 at 06:18
  • You might check with database profiling to see how that query is actually sent to the mongod – Joe Apr 30 '20 at 07:33
  • @Joe I enabled database profiling and checked the output. I could see almost 90+ entries for the single query. In it, there were both queries and commands. Most queries had COLLSCAN and few had indexed scan. I am not able to get a 1 to 1 mapping between drill and mongo profiling output. – bukli May 02 '20 at 08:35

1 Answers1

0

The problem lies in the mongo-storage plugin dealing with Timestamp filter predicates. The filter predicates are evaluated in the following modules in the given order.

MongoPushDownFilterForScan -> MongoFilterBuilder -> MongoCompareFunctionProcessor.process() -> MongoCompareFunctionProcessor.visitSchemaPath()

The visitSchemaPath method acts like a getter method for the value expression classes. I see that there is no handler for TimestampExpression. Added the piece of code below, rebuilt and tested it.

if (valueArg instanceof TimeStampExpression) {
 Long unixseconds = ((TimeStampExpression) valueArg).getTimeStamp();
 this.value = new Date(unixseconds);
 this.path = path;
 return true;
 }

This makes the timestamp filters to be passed into the filter section of mongo queries.

bukli
  • 172
  • 2
  • 9