I've had a weired behaviour with mongodb distinct query. currently, i'm using 2.6.10 version. Ok, let's create simple collection for test with explanation.
from pymongo import MongoClient
import random
client = MongoClient('127.0.0.1', 27017)
client.DBTEST.random.remove({})
value = 0
BATCH_LEN = 16384
BATCH = []
for i in xrange(0, 500000):
BATCH.append({
"product": "value_uniq_1",
"number": value
})
if random.randint(0, 100) <= 1:
value = i
if len(BATCH) > BATCH_LEN:
client.DBTEST.random.insert(BATCH)
BATCH = []
client.DBTEST.random.insert(BATCH)
BATCH = []
Ok, it will create collection chich contains documents like this
╔══════════════╦════════╗
║ product ║ number ║
╠══════════════╬════════╣
║ value_uniq_1 ║ 1 ║
║ value_uniq_1 ║ 1 ║
║ value_uniq_1 ║ 1 ║
║ value_uniq_1 ║ 56 ║
║ value_uniq_1 ║ 56 ║
║ value_uniq_1 ║ 56 ║
║ ... ║ ... ║
║ value_uniq_1 ║ 150054 ║
║ value_uniq_1 ║ 150054 ║
║ value_uniq_1 ║ 150054 ║
╚══════════════╩════════╝
Now, I have only 1 unique value for product
, but, in near future(1 week) it will increase to near 30 different string values, like this:
╔══════════════╦════════╗
║ product ║ number ║
╠══════════════╬════════╣
║ value_uniq_1 ║ 1 ║
║ value_uniq_1 ║ 1 ║
║ value_uniq_1 ║ 1 ║
║ value_uniq_1 ║ 56 ║
║ value_uniq_1 ║ 56 ║
║ value_uniq_1 ║ 56 ║
║ ... ║ ... ║
║ value_uniq_1 ║ 150054 ║
║ value_uniq_1 ║ 150054 ║
║ value_uniq_1 ║ 150054 ║
║ value_uniq_2 ║ 987 ║
║ value_uniq_2 ║ 987 ║
║ value_uniq_2 ║ 987 ║
╚══════════════╩════════╝
Ok, i finished with my data structure, now lets look a little bit of mongodb queries.
My main goal, is to get all unique values of number
for certain product
.
I do it in this way:
db.random.distinct("number", {product: "value_uniq_1"})
Ok, this is not quite verbose for debugging, and i will use db.runCommand
in the next lines. But, now, lets avoid using query for distinct and look of stats
section:
db.runCommand({distinct: 'random', key:'number'})
"stats" : {
"n" : 500000,
"nscanned" : 500000,
"nscannedObjects" : 500000,
"timems" : 479,
"cursor" : "BasicCursor"
},
That's okay, because we didn't create indexes yet, lets add for number
field:
db.random.createIndex({number: 1})
ReRun previous query:
db.runCommand({distinct: 'random', key:'number'})
"stats" : {
"n" : 10005,
"nscanned" : 10005,
"nscannedObjects" : 0,
"timems" : 83,
"cursor" : "DistinctCursor"
},
Great, it uses indexes, and everything working fine! 0 nscannedObjects!!!
Okay, lets add query for distinct:
db.runCommand({distinct: 'random', key:'number', query: {product: "value_uniq_1"}})
"stats" : {
"n" : 500000,
"nscanned" : 500000,
"nscannedObjects" : 500000,
"timems" : 694,
"cursor" : "BasicCursor"
},
It's not what we expected("nscannedObjects" : 500000), BUT, there is no index for product, lets create some one:
db.random.createIndex({product: 1, number: -1})
There is no difference with direction, any combination of product: 1, number -1 OR product -1, number 1, OR product: 1, number: 1 gives the same behaviour. I've checked all combinations.
db.runCommand({distinct: 'random', key:'number', query: {product: "value_uniq_1"}})
"stats" : {
"n" : 500000,
"nscanned" : 500000,
"nscannedObjects" : 500000,
"timems" : 968,
"cursor" : "BtreeCursor product_1_number_-1"
},
WTF is going on? Why it's scanning all collection with an index? Currently, whole collection contains with only ONE product value, and I can't guess what will be with different products. Why it's so slow with common distinct query? 1 sec it's too slow...
I don't want to use separate collection for each product
it's crazy and unefficient, because I need shared queries between all products. My real DB contains more than 5millions of numbers per one product and it takes more that 3 seconds for this query.