1

We are new to Mongodb, and would like to use it to insert genomic data (165M entries), and retrieve this data by genomic coordinates (ranges). Below is the type of data we store in a single table. Where column names are chrom,chromStart,chromEnd,datasetid,target,biotype

chr1 9903 10282 ENCSR440COG ZNF239 HEK293
chr1 9904 10252 ENCSR721QZV ZSCAN18 HEK293
chr1 9905 10132 ENCSR241LIH AFF1 K-562
chr1 9905 10238 ENCSR211GNP ZSCAN4 HEK293
chr1 9905 10241 ENCSR776LDJ ZNF645 HEK293
chr1 9905 10243 ENCSR042TWZ SNIP1 MCF-7
chr2 938173 938703 ENCSR000BUL MAX MCF-7
chr2 938174 938376 ENCSR108TYQ GATAD1 Hep-G2
chr3 938174 938412 ENCSR887MXT ZHX1 HeLa-S3
chr3 945236 945377 GSE46055 KDM5B SUM185_SHCTCF
chr4 945236 945488 ENCSR000BPU ETS1 A-549
chr4 945240 945501 GSE76494 CTCF HEK293
chr4 950008 951114 GSE67783 STAG1 HSPC
chr4 950013 950185 ENCSR000BQT TCF3 GM12878
chr4 950015 950797 ENCSR115BLD KDM1A Hep-G2
chr4 950024 950693 GSE88734 ZEB1 MIA-PaCa-2
chr4 950028 950565 ENCSR753GIA TARDBP HEK293T

The type of genomic ranges queries would be :

db.hsap_all_peaks.find({ chrom: "chr1",  chromStart: {$gte: 9905}, chromEnd:{$lte: 10243}} ).count()
db.hsap_all_peaks.find({ chrom: "chr4",  chromStart: {$gte: 950013}, chromEnd:{$lte: 950693}} ).pretty()

In the long run, we plan to queries on ranges but also on values like:

db.hsap_all_peaks.find({ chrom: "chr4",  chromStart: {$gte: 950013}, chromEnd:{$lte: 950693}} , target: "KDM1A").pretty()

This is how we created indexes for the coordinates :

db.hsap_all_peaks.createIndex(
    {chrom:1}
)
db.hsap_all_peaks.createIndex(
    {chrom:1,chromStart:1,chromEnd:1}
)

However, the queries are very long to execute, and it seems that the indexes for chromStart, chromEnd are not working.

Hence my question: what would be the best way to create indexes here?

Extra information :

> db.hsap_all_peaks.getIndexes()
[
    {
        "v" : 2,
        "key" : {
            "_id" : 1
        },
        "name" : "_id_"
    },
    {
        "v" : 2,
        "key" : {
            "chrom" : 1,
            "chromStart" : 1,
            "chromEnd" : 1
        },
        "name" : "chrom_1_chromStart_1_chromEnd_1"
    },
    {
        "v" : 2,
        "key" : {
            "chrom" : 1
        },
        "name" : "chrom_1"
    }
]

If you want to recreate a similar'ish table :

wget http://remap.univ-amu.fr/storage/remap2020/hg38/MACS2/remap2020_all_macs2_hg38_v1_0.bed.gz
gunzip remap2020_all_macs2_hg38_v1_0.bed.gz
mongoimport -d databaseName -c hsap_all_peaks --type tsv --file remap2020_all_macs2_hg38_v1_0.bed -f chrom,chromStart,chromEnd,name,score,strand,thickStart,thickEnd,itemRgb  --numInsertionWorkers 2

Explain() output for a classic query :

db.hsap_all_peaks.find({ chrom: "chr2",  chromStart: {$gte: 50967094}, chromEnd:{$lte: 50970983} }  ).explain()
{
    "queryPlanner" : {
        "plannerVersion" : 1,
        "namespace" : "remap2020.hsap_all_peaks",
        "indexFilterSet" : false,
        "parsedQuery" : {
            "$and" : [
                {
                    "chrom" : {
                        "$eq" : "chr2"
                    }
                },
                {
                    "chromEnd" : {
                        "$lte" : 50970983
                    }
                },
                {
                    "chromStart" : {
                        "$gte" : 50967094
                    }
                }
            ]
        },
        "queryHash" : "2A452369",
        "planCacheKey" : "C93EF492",
        "winningPlan" : {
            "stage" : "FETCH",
            "inputStage" : {
                "stage" : "IXSCAN",
                "keyPattern" : {
                    "chrom" : 1,
                    "chromStart" : 1,
                    "chromEnd" : 1
                },
                "indexName" : "chrom_1_chromStart_1_chromEnd_1",
                "isMultiKey" : false,
                "multiKeyPaths" : {
                    "chrom" : [ ],
                    "chromStart" : [ ],
                    "chromEnd" : [ ]
                },
                "isUnique" : false,
                "isSparse" : false,
                "isPartial" : false,
                "indexVersion" : 2,
                "direction" : "forward",
                "indexBounds" : {
                    "chrom" : [
                        "[\"chr2\", \"chr2\"]"
                    ],
                    "chromStart" : [
                        "[50967094.0, inf.0]"
                    ],
                    "chromEnd" : [
                        "[-inf.0, 50970983.0]"
                    ]
                }
            }
        },
        "rejectedPlans" : [
            {
                "stage" : "FETCH",
                "filter" : {
                    "$and" : [
                        {
                            "chromEnd" : {
                                "$lte" : 50970983
                            }
                        },
                        {
                            "chromStart" : {
                                "$gte" : 50967094
                            }
                        }
                    ]
                },
                "inputStage" : {
                    "stage" : "IXSCAN",
                    "keyPattern" : {
                        "chrom" : 1
                    },
                    "indexName" : "chrom_1",
                    "isMultiKey" : false,
                    "multiKeyPaths" : {
                        "chrom" : [ ]
                    },
                    "isUnique" : false,
                    "isSparse" : false,
                    "isPartial" : false,
                    "indexVersion" : 2,
                    "direction" : "forward",
                    "indexBounds" : {
                        "chrom" : [
                            "[\"chr2\", \"chr2\"]"
                        ]
                    }
                }
            }
        ]
    },
    "serverInfo" : {
        "host" : "sormiou.local",
        "port" : 27017,
        "version" : "4.4.3",
        "gitVersion" : "913d6b62acfbb344dde1b116f4161360acd8fd13"
    },
    "ok" : 1
}
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Benoit B.
  • 11,854
  • 8
  • 26
  • 29
  • 1
    I dont know the answer, but a quick google indicates MongoDb supports query plans and ways to dig into them using https://docs.mongodb.com/manual/reference/method/db.collection.explain/#mongodb-method-db.collection.explain – Matt Evans Apr 29 '21 at 08:31
  • General guidelines about [MongoDB Indexing Strategies](https://docs.mongodb.com/manual/applications/indexes/). – prasad_ Apr 29 '21 at 08:49
  • Thanks @prasad_ this is the first thing I did, googled it, and looked at the guidelines, however didn' t really found an answer (hence my post here) – Benoit B. Apr 29 '21 at 08:58
  • Having the two indexes is not required - "chrom_1" and "chrom_1_chromStart_1_chromEnd_1" . For a compound index, the _index prefix_ serves the purpose of the "chrom_1" (in your case). – prasad_ Apr 29 '21 at 09:21
  • 1
    Your queries (as listed in the post) cannot use all indexes. As I see you only require one compound index. (Unless you are querying on individual fields). Indexes use disk space and RAM. – prasad_ Apr 29 '21 at 09:34
  • How " it seems that the indexes for chromStart, chromEnd are not working." ? What `db.hsap_all_peaks.find({ chrom: "chr1", chromStart: {$gte: 9905}, chromEnd:{$lte: 10243}} ).explain()` says? It should tell you precisely which index it used for this particular query. – Alex Blex Apr 29 '21 at 09:50
  • @AlexBlex I have added the `explain()` output, for a classic query – Benoit B. Apr 29 '21 at 10:19
  • 1
    The part `winningPlan" : { ..... "inputStage" : { "stage" : "IXSCAN", .... "indexName" : "chrom_1_chromStart_1_chromEnd_1"` says it does use the index. It is the compound one, and it supports your query the most. There is nothing to improve there. – Alex Blex Apr 29 '21 at 10:41
  • Thanks @AlexBlex - It does seems indeed. But I have a hard time believing that a query (with indexes) could take 16sec or more to return entries – Benoit B. Apr 29 '21 at 11:28
  • Is the 14Gb .bed file the whole collection or only a sample? If former it is indeed too slow. I would check system load if there are any particular bottlenecks like RAM, I/O, CPU. Then check mongodb performance, concurrent queries, etc. If working set doesn't fit into RAM you can expect some performance drop. This may help: https://stackoverflow.com/questions/6453584/what-does-it-mean-to-fit-working-set-into-ram-for-mongodb But it's a bit more generic "query optimisation" rather than "indexing" alone. – Alex Blex Apr 29 '21 at 12:16
  • You don't run it in docker do you? – Alex Blex Apr 29 '21 at 12:17
  • No docker. And yes the whole 14Go file is the whole collection (164 million entries). Running on 32Go RAM iMac, SSD HD. Nothing other concurrent queries. – Benoit B. Apr 29 '21 at 12:20

2 Answers2

0

To summarise the comments there is nothing to improve in the index. You already have the one that supports your query the most, and mongo actually uses it.

Improving performance is a bit more generic topic and is probably too wide for the SO format.

In this particular case if we can assume that chromStart is always less than chromEnd we can modify the query by adding top limit of chromStart the same as low limit of chromEnd :

db.hsap_all_peaks.find({ 
    chrom: "chr1",  
    chromStart: {$gte: 9905, $lte: 10243}, 
    chromEnd:{$lte: 10243}
} )

It will change the range in the explain() from

            "indexBounds" : {
                "chrom" : [
                    "[\"chr2\", \"chr2\"]"
                ],
                "chromStart" : [
                    "[9905.0, inf.0]"
                ],
                "chromEnd" : [
                    "[-inf.0, 10243.0]"
                ]
            }

to

            "indexBounds" : {
                "chrom" : [
                    "[\"chr2\", \"chr2\"]"
                ],
                "chromStart" : [
                    "[9905.0, 10243.0]"
                ],
                "chromEnd" : [
                    "[-inf.0, 10243.0]"
                ]
            }

The smaller the chromStart range the less nodes in the index it will need to examine.

Alex Blex
  • 34,704
  • 7
  • 48
  • 75
0

The purpose of the index is to reduce the amount of data that must be examined to process the query.

With the existing index and query, you are avoiding looking at any document that will not match the query.

The only possible improvement will be in reducing the number of index keys examined.

The way the mongod will scan the index starts from negative infinity for an $lte operator, and extends to positive infinity for a $gte operator.

The sample data appears to have the property that chromStart is strictly less than chromEnd for any single document.

If that assumption is correct, you can use that to optimize the query a little bit more by restricting the limits.

Consider the query you explained:

db.hsap_all_peaks.find({ 
              chrom: "chr2",  
              chromStart: {$gte: 50967094}, 
              chromEnd:{$lte: 50970983} 
})

As the explain command reported, the winning plan uses index bounds:

 "indexBounds" : {
                    "chrom" : [ "[\"chr2\", \"chr2\"]" ],
                    "chromStart" : ["[50967094.0, inf.0]"],
                    "chromEnd" : ["[-inf.0, 50970983.0]"]
                }

Those inf.0 mean infinity, which is probably quite a few keys for non-matching documents.

If you were to use both values for both criteria in the query, like:

db.hsap_all_peaks.find({ 
              chrom: "chr2",  
              chromStart: {$gte: 50967094, $lte: 50970983}, 
              chromEnd:{$gte: 50967094, $lte: 50970983} 
})

Those index bounds could be reduced (in theory) to:

 "indexBounds" : {
                    "chrom" : [ "[\"chr2\", \"chr2\"]" ],
                    "chromStart" : ["[50967094.0, 50970983.0]"],
                    "chromEnd" : ["[50967094.0, 50970983.0]"]
                }

In a very large data set, that could be millions of keys that no longer need to be evaluated.

Or it could be a total waste of time.

I would be very interested to hear if it actually helps.

Joe
  • 25,000
  • 3
  • 22
  • 44